Friday, December 19, 2014

Survivorship bias and genetics

I was a mathematics major as an undergraduate.  However, not then or since have I been anything that one could call a mathematician.  At least, I hope I learned something about trying to think logically about life even if I never do equations.  But this interest led me to read a new book I was told of, called How Not to be Wrong: the Power of Mathematical Thinking, by Jordan Ellenberg (2014, NY, Penguin Press).

This is a popular rather than technical book, but it shows in interesting and serious ways how mathematical thinking can lead to improved understanding of the real world.  I think it has relevance to an important area in current evolutionary and biomedical or agricultural genetics.  So I thought I'd write a post about it.

Survivorship bias
Ellenberg begins his book with an illustration of how abstract logical thinking can solve important real-world problems in subtle ways.  In WWII a mathematics research group was asked by the Army to help them locate armor plating on fighter aircraft.  The planes were returning to base with scattered bullet holes from enemy fire and the idea was to put some protective plating where it would do the most good without adding cumbersome mileage-eating weight.  The mathematician suggested to put the plating where the bullet holes weren't.  This seemed strange until he explained that this was because the bullet holes that were observed hadn't done much damage: bullets hitting elsewhere had brought the plane down so it was never observed because the plane never returned to base.  The engine compartment was the case in point: a shot to the engine was fatal to the aircraft, but to the wings and body, much less so.

This is a case of survivorship bias.  It can apply widely, and evolution and genetic causation provide instances where it seems likely to be a useful principle.  As geneticists we ask, what the genes whose variation causes variation in adaptive or biomedically interesting outcomes.  This is what genome mapping in its various forms is intended to identify.

Ironically, it seems, when we do experiments involving development or testing of genetic mechanisms by, say, knocking out a gene, or when we observe the major gene-usage switches that occur when some part of an embryo's body are forming, we can identify specific genes that seem to be very important.

Several pieces of evidence can suggest they are important.  One is the finding that the same gene is used in similar roles in very distantly related species (often, even, between humans and flies or even more distant species). It's usage has been conserved.  Secondly, there is usually far less variation within or between species in such genes than in what we believe to be non-functional or marginally functional parts of genomes. This seems to suggest that variation hasn't been tolerated by natural selection.  Thirdly, many congenital diseases in plants and animals including humans have proven to be due to the effects of variants, often newly arisen mutations, in a specific gene.  Most cases of diseases like Cystic Fibrosis, Phenylketonuria, Muscular Dystrophy, or Tay Sachs Disease are of this sort.  Some congenital traits like, say, eye or skin color, are also due to inheriting specific variants in at least relatively few genes.

Such findings at least indirectly fueled the fervor for mapping every trait one can define, with grand promises of discovering the genes 'for' the trait.  Conscientious investigators justified expensive mapping efforts by showing that their trait of interest had substantial heritability, for example, because trait-values were to a substantial extent correlated among close relatives in predicted patterns.  However, for most traits like diabetes, cancer, heart disease, or behavioral characteristics, such findings are few and far between.

Despite a welter of PR spin to the contrary, instead of dramatic findings of the expected (and promised) sort, what was found was that the traits were affected by variation in tens, hundreds, or even thousands of different parts of the genome. Even taking all these together, they typically only accounted for a fraction--usually a small fraction--of the estimated heritability.

What is this 'missing heritability'?

Evolutionary survivorship bias
A central theoretical idea is that a fundamental genomic criterion for showing biological function is sequence conservation.  Most evolution is purifying: what has been put together over billions of years is risky to change.  So most mutations in clearly functional areas of DNA are either neutral or deleterious.  As a result, more variation accumulates in non- or weakly functional DNA than in important genes.

What that means is that the variation we see misses what existed heretofore and hence is not a representative sample of all the variation that arises.  The idea can be that most variation in genomes is of major importance.  As a result, the tendency is to assume that non-conserved areas of the genome are non-functional.  This may be true, but it may be that our belief that conservation equals function is a corollary of a belief in strong Darwinian natural selection in molding traits.  In fact, most genomic variation is not of the highly conserved sort, but our analysis and explanation of functional genomics is biased by our predilection for ignoring less-conserved variation.

This can be seen as a kind of survivorship bias in that we assume that variation in non-conserved genome areas just doesn't survive for very long--isn't conserved because it has no function.  That's a kind of circular reasoning and has been, for example, highly contentious in the interpretation of the ENCODE project's objective to identify all causal elements in the genome, and in questions about whether selectively neutral variation exists at all. The same conceptual bias leads to reconstructions of evolutionary adaptive history that centers on the conserved genes as if they were the genes that were involved.  Finally, important genes that were involved in a trait's evolution to its current state may no longer be involved, and hence not be considered because their role did not survive to be identified today.

Biomedical survivorship bias
The same sort of bias in ascertaining the spectrum of causal variation exists on the shorter life-time scale of biomedical genetics.   There is a big discrepancy between the clearly key role of genes identified in experimental and developmental genetics, and in the deeply conserved nature of those genes, and the general lack of 'hits' in those genes when genomewide mapping is done on traits those genes affect.

How can a gene be central to the development of the basis of a trait, and yet not be found in mapping to identify variation that causes failures of the trait?  Indeed, the basic finding of GWAS and most other mapping approaches is that the tens or hundreds or thousands of genome 'hits' have individually trivial effects.

The answer may lie in survivorship bias.  Like the lethality of bullets to the engine of a fighter, most variation in the main genes, those whose sequence is more highly conserved, is lethal to the embryo or manifest in pathology so clear that it never is the subject of case-control or other sorts of Big Data mapping.  In other words, genome mapping may systematically be inevitably constrained to find small effects!  That's exactly the opposite of what's been promised, and the reason is that the promises were, psychologically or strategically, based on extrapolation of the findings of strong, single-gene effects causing severe pediatric disease--a legacy of Mendel's carefully chosen two-state traits.

To the extent this is a correct understanding, then genomewide mapping as it's now being done is, from an evolutionary genomic perspective, necessarily rainbow-chasing.  Indeed, a possibility is that most adaptive evolution is itself also due to the effects of minor variants, not major ones.  Once the constraining interaction of the major genetic factors is in place, mostly what can nudge organisms in this direction or that, whether adaptively or in relation to complex, non-congenital disease, is based on assembled effects of individually very minor variants.  In turn, that could be why slow, gradualism was so obviously the way evolution worked to Darwin, and why it generally still seems that way today.

Survivorship bias is a kind of mis-understanding of statistics and sampling that careful reasoning can illuminate.  It is so easy to collect biased samples, and so hard to do otherwise, and consequently so easy to make convenient, but erroneous inferences.  Science is a complex business and it's an unending challenge to do it right--even to know when we are doing it right!

No comments: