Issue #60 // Compression As Understanding
The Tension Between Mechanistic Understanding/Clinical Interpretability and Predictive Biology
Liked this piece? If so, tap the 🖤 in the header above. It’s a small gesture that goes a long way in helping me understand what you value and in growing this newsletter. You can also subscribe for free to have the next post delivered to your inbox:
Issue № 60 // Compression As Understanding
The most useful computational models in biology aren’t the ones with the most parameters—they’re the ones that achieve the most compression. When you can predict gene expression across dozens of conditions using only a handful of latent factors or explain variations in protein abundance with a small set of regulatory relationships you’ve found something closer to biological "ground truth" than any super high-dimensional model that just memorizes your training data and spits out a prediction, even if it’s more accurate (as defined in a specific testing context).
Beyond their use as data visualization tools, dimensionality reduction techniques—like PCA, t-SNE, and UMAP—can be used to parse the underlying structure of biological systems, which I wrote about previously in Issue #36 // Machine Learning: The Native Language of Biology, stating:
"In a neural network, high-dimensional input data is compressed into latent representations— abstract patterns that capture essential features while discarding noise. Similarly, a transcription factor distills complex environmental information into a binary state (active or inactive) that the cell can use to make decisions. This concept of latent spaces offers another window into why machine learning aligns so well with biology. In cell biology, high-dimensional data like gene expression profiles or microscopy images can be projected into lower-dimensional spaces that capture meaningful biological variation. Each dimension in this latent space ideally corresponds to some biological process or state—cell cycle phase, differentiation stage, or stress response."
If 20,000 genes can be projected into 50 principal components that capture most of the variance, this tells us the genome isn’t actually operating in 20,000-dimensional space. It’s operating in something far lower-dimensional, with the vast majority of genes acting as downstream readouts of a smaller set of regulatory "programs".
The same idea can be applied to patient stratification. Take breast cancer, for example. We know every patient’s cancer is unique with its own gene mutations and amplifications, alterations to transcriptional programs, behaviors, and responses to treatment. This idea is encapsulated by the Anna Karenina principle in Oncology, which states something to the effect that all healthy tissues are alike, whereas all diseased tissues are diseased in their own way. Yet, if we can compress thousands of clinical and molecular features into a small number of subtypes that meaningfully predict treatment response (such as ER/HER status, PAM50 subtypes, MammaPrint risks profiles, etc), there is an argument to be made that despite enormous molecular complexity, the relevant biological variation between tumors has low intrinsic dimensionality1.
However, there is a strong argument to be made that compression is most useful when it leads to increased understanding of the underlying biological systems you’re working on. An auto-encoder can compress gene expression data perfectly, while learning latent representations that cannot be biologically, or mechanistically, interpreted. The representations that matter are the ones that align with biological processes we can observe, measure, or manipulate because those are the dimensions along which we can actually intervene.
This is why interpretable models often outperform black boxes in biological and clinical applications/use cases, even when they’re less accurate. A decision tree that says "if local tissue hypoxia and immune infiltration, then poor prognosis" is testable biology. A neural network that achieves 2% better accuracy but can’t explain its reasoning in biological terms may just be performing complex pattern matching. In this way, the goal of computational modeling isn’t prediction for its own sake. It’s to build models that can compress biological complexity and heterogeneity in ways that suggest new experiments, reveal potential mechanisms or action, and ultimately show us levers that we can pull to intervene in disease.
George Church recently gave a great interview on this topic for the Lifespan Research Institute which you can find here. I’ve reposted a short except below:
About that: it’s always interpretability versus the model’s power. Where are you in this debate? Would you prefer a weaker but more interpretable AI or a stronger but less interpretable one?
GC: I lean on the interpretability side. It’s not an either-or, but… we’re in science. Few engineers are willing to just pull a rabbit out of a hat, just a black box. Scientists and engineers, by and large, want to know the mechanism. The FDA likes to know mechanisms. Typically, the autocatalytic loop where you learn something and then you invent something is better if it’s mechanistically grounded. So, I lean pretty heavily in the direction of interpretability, explainability, transparency, et cetera, and also it’s safer.
I just honestly think that we will soon be faced with this dilemma, where we will have to choose between the power of the model to do things and its actual interpretability, but maybe we’re not there yet.
GC: If you look at the human scientist experience, the most powerful sciences are the ones that are better articulated mechanistically on a solid foundation rather than black boxes. The black boxes tend to include artifacts, dead ends. Most of the progress in science and engineering has been part of community efforts with strong mechanistic underpinnings.
These ideas stand opposed to the core thesis of predictive biology, which argues that predicting the outcome of an unknown experiment is equivalent to understanding a system. While beyond the scope of this article, the tension between "mechanism as understanding" and "prediction as understanding" is interesting to reflect on. As a biologist, I tend to lean towards the former.
This tension has played out in my own research a bit, where I’ve worked with both Cypher query-based graph traversal methods and graph neutral network-based approaches to predict phenotypes from molecular perturbation data. The former is more interpretable, allows for novel pathway discovery, and provides more causal information, which both makes it more trustworthy from a clinical standpoint. Yet, the GNN based approaches often make predictions that the traversal based ones miss, which while hard to validate the utility of are interesting nonetheless. Of course, there are ways to take the best of both of these worlds and create a hybrid system, but at the end of the day we still need to decide how much we’re willing to trust predictions we don’t fully understand, and the degree to which we need our models to be mechanistic and interpretable for them to be useful.
The big caveat here is that the compressed subtypes have to predict treatment response better than higher dimensional alternatives in order to be considered meaningful. As I’ve written about previously, clustering techniques will always "work" in that they will produce partitions, or groupings, in data whether or not those partitions reflect biological reality. It’s up to us to determine if the boundaries between clusters reflect genuine biological discontinuities or arbitrary algorithmic choices, which is why models with mechanistic explainability are so valuable.



