# Overfitting – the elephant in the Cartesian plane

Just because it’s fun and because I’m still not on the other side of my blog holiday, here’s a nice picture of an elephant from Burnham and Anderson (2002) who got it from Wei (1975).

It’s all in response to John von Neumann who said,

“with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”

But that was more hyperbole than quantitatively accurate. Further investigations by Wei (1975) revealed that the 30-term elephant

“may not satisfy the third-grade art teacher, but would carry most chemical engineers into preliminary design.”

Burnham K, Anderson D. 2002. Model selection and multimodel inference. Springer, USA.

Wei J. 1975. Least square fitting of an elephant. Chemtech 5: 128–129.

# Why parsimony?

One question is does there necessarily exist a simple model for a given biological question, the other is, is there a unique model? And taking that one step further, given two models that are equal in all regards except that one is more complex, why should we favour the more simple model? This argument, that we should prefer simpler explanations, is Occam’s razor.

William of Ockham. This picture is attributed to the following source.

Here’s the definition of Occam’s razor from Wikipedia:

It is a principle urging one to select, among competing hypotheses, that which makes the fewest assumptions and thereby offers the simplest explanation of the effect.

Justifications for Occam’s razor

• Aesthetic: nature is simple and simple explanations are more likely to be true.
• Empirical: You want the signal; you don’t want the noise. A complex model will give you both, e.g. overfitting in statistics.
• Mathematical: hypotheses that have fewer adjustable parameters will automatically have an enhanced posterior probability because the predictions are sharper (Jeffreys & Berger, 1991)
• Practical: it is easier to understand simple models.

Alternatives to Occam’s razor

• Popper (1992): For Popper it can all be cast in the light of falsifiability. We prefer simper theories “because their ecological context is greater” and because they are testable.
• Elliot Sober: simplicity considerations do not count unless they reflect something more fundamental.**

And yet my initial reaction to the definition of Occam’s razor was that it sounded a bit strange: simple explanations and few assumptions? Yikes, I can give you your simple explanation, but it’s going take a lot of assumptions to get there. I think my confusion could be due to a difference in bookkeeping (and the phrasing ‘simple explanation of the effect‘). In the Occam’s razor definition, you score only assumptions that contribute to the explanation. In biology, if the true explanation consists of n things-that-matter, the theoretician will say that the observation can be reproduced by only considering k < n of those things. Here, biologists are used to scoring the number of assumptions as the number of things that are suspected to matter but that are neglected, i.e. nk. This difference would seem to suggest that, although in biology we do value simplicity, we also value explanations that incorporate known contributing factors over explanations that ignore these. These types of values are reflected in Elliot Sober’s view on Alternatives to Occam’s razor as described above.

However, even given that caveat, I think we still often prefer simple models in biology. Why? Here’s Ben Bolker (p7)*** with some insight:

By making more assumptions, (mechanistic models) allow you to extract more information from your data – with the risk of making the wrong assumptions.

That does kind of sum it up from the data analysis perspective: simple models make a lot of assumptions, but at the end of it you can conclude something concrete. Complex models still make assumptions, but they are a less restrictive type of assumption (i.e., an assumption about how a factor is included rather than an assumption to ignore it). All this flexibility in complex models means that many different parameter combinations can lead to the same outcome: inference is challenging, and parameters are likely to be unidentifiable. Given Wikipedia’s list of different justifications of Occam’s razor this seems to be an example of ‘using the mathematical justification to practical ends’. That is to say, this argument doesn’t seem to fit well into the list of justifications, but elements of the mathematical and the practical justifications are represented. Or perhaps it fits with Popper’s alternative view?

For the theoretical ecologist, another reason that parsimony is often favoured is certainly the practical justification: because simple models are easier to understand.

What do you think? Is parsimony important in biology? And why?

References

Jeffreys and Berger (1991) Sharpening Ockham’s Razor on a Bayesian Strop. [pdf] Alternatively, if that isn’t satifying this might do the trick:

Quine, W (1966) On simple theories in a complex world. In The Ways of Paradox and Other Essays. Harvard University Press.****

————–

*okay, so maybe the actual highlight for me was learning a new expression. The expression is ‘turtles all the way down’ and the best way to explain it is by using it in a sentence. Here goes: sometimes people say ‘yes, but that’s not really a mechanistic model because you could take this small part of it and make that more mechanistic, and then you could take parts of that and make those more mechanistic.’ And to that, I would say ‘yes, but why bother? It’s just going to be turtles all the way down‘.

**fundamental = mechanistic, i.e. biological underpinning. This is a quote from Wikipedia and I need to chase down the exact reference for the statement. I have Elliot Sober (200o) Philosophy of Biology but he doesn’t seem to say anything quite this definitive.

***Ben suggests the references:  Levins (1966) The strategy of model building in population biology;  Orzack and Sober (1993) A critical assessment of Levin’s The strategy of model building in population biology; and Levins (1993) A response to Orzack and Sober: Formal Analysis and the fluidity of science. [I’ll read them and let you know.]

****I haven’t read either, I just list the references in case anyone wants to follow up.

# Making the list, checking it twice

The table below lists the goals of mathematical modelling as described in three books and one journal article with respect to my list.* For each reference, when an item from my list is mentioned, I provide the page number, section, or chapter where the mention is made.

 Levin Caswell Hilborn Haefner Otto 1. Quantitative prediction p424@ p34 Ch 4 1.3.3 2. Qualitative prediction p424 p34 Ch 3 3. Bridge between different scales 4. Parameter estimation 1.2 (2.) 1.3.2; 1.3.4 5. Clarify logic behind a relationship p424 p34 1.3.1 6. Test hypothetical scenarios p424 p34 1.2 (4.) 1.3.3 7. Motivate test/experiment p424 (6) p38 1.3.1 8. Disentangle multiple causation 9. Make an idea precise, integrate thinking p424 (4) p38 1.2 (3.) 10. Inform data needs 1.2 (1.) 11. Highlight sensitivities to parameters or assumptions (3) p38 12. Determine the necessary requirements for a given relationship (2) p38 13. Characterize all theoretically possible outcomes (1) p38 14. Identify common elements from seemingly disparate situations p424 (2) p38 15. Detect hidden symmetries and patterns p424

References

Levin (1980), Mathematics, ecology and ornithology. Auk 74: 422-425.**

Caswell (1988), Theory and models in ecology: a different perspective. Ecological Modelling 43: 33-44.***

Hilborn and Mangel (1997), The Ecological Detective. Princeton Monographs.

Haefner (1996), Modeling biological systems: principles and applicaitons. Chapman and Hall.

Otto and Day (2006), A biologist’s guide to mathematical modeling in ecology and evolution. Princeton University Press.

Footnotes

*Please feel welcome to suggest references to be added or to disagree with the placement of items in the table.

References suggested by **lowendtheory and ***Pablo Almaraz during comments on the ‘Crowdsourcing’ post at the Oikos blog.

@Although Levin advocates for the derivation of qualitative models as these rest on firmer axioms.

# The Geometry Selfish Herd (Hamilton, 1971)

No apology, therefore, need be made even for the rather ridiculous behavior that tends to arise in the later stages of the model process, in which frogs supposedly fly right around the circular rim… The model gives a hint which I wish to develop: …the selfish avoidance of a predator can lead to aggregation.

— W.D. Hamilton (1971)

Sheep cyclone got me thinking about Hamilton’s selfish herd (pdf). It’s been a while since I’ve commented on model derivation with reference to the published literature, and so it seemed like a good idea to re-read Geometry of the selfish herd (1971) with the goal of discussing Hamilton’s decision-making process regarding his model assumptions.

To quickly summarize, the model from Section 1 of the Selfish Herd involves a water snake that feeds once a day on frogs. If they stay in the pond, the frogs will certainly be eaten and so they exit the pond and arrange themselves around the rim. The snake will then strike at a random location and eat the nearest frog. A frog’s predation risk is described by its ‘domain of danger’, which is half the distance to the nearest frog on either side (see Figure). Lone frogs have the highest risk of predation, which leads to the formation of frog aggregations (like the one in the lower left corner of the Figure). The model from Section 3 of the Selfish Herd presents the same problem, but in two-dimensional space, and so now the domains of danger are polygons. In relation to Section 3, I think Hamilton foreshadows the sheep cyclone because he considers the case where the predator is concealed in the centre of a herd before pouncing, as well as discussing (the probably more common) predation on the margins (when the predator starts on the outside).

As you can see from the quote above, Hamilton makes no apologies for unrealistic qualities because his model gives him some helpful insight. This insight is that aggregations could arise from a selfish desire to diffuse predation risk. In terms of the model derivation, a helpful construct is the domain of danger, whereby minimizing the domain of danger likely corresponds to minimizing predation risk, assuming the predator only takes one prey and starts searching in a random location.

Overall, the one-dimensional scenario seems contrived and I’m not sure if I understand how the insight from one-dimension carries over into two-dimensions.  In two-dimensions, what incentive is there for initially-far-away-from-the-lion-ungulates to let the initially-near-the-lion-ungulates catch-up and form a herd? The model in Section 1, is what I would call a ‘toy model’ – it acts as a proof of concept, but is so simple that its value is more as ‘something to play around with’ rather than something intended as a legitimate instrument. I wonder about the relevancy of edge effects – in Section 1, the model is not just one-dimensional, but the frogs are limited to the rim of the pond which is of finite length. The more realistic two-dimensional example of ungulates in the plain should consider, I think, a near infinite expanse. If the one-dimensional problem was instead ‘frogs on a river bank’, would this all play out the same? Would frogs on the river bank aggregate?

Pre-1970’s computing efficiency probably made it quite difficult for Hamilton to investigate this question to the level that we could today, but none-the-less, I’m going to put this model in the ‘Too Simple’ category. For me, this paper never reaches the minimum level of complexity that I need it to – that would be: (a) simultaneous movement of prey; (b) in response to a pursuing predator; and (c) in an expansive two-dimensional space. Aside from this paper, Hamilton sounds like he was an entertaining fellow whose other work made highly substantive contributions.