Twice as a student my professors off-handedly remarked that the parameterization of probabilistic models for real world situations lacked a sound philosophical basis. The first time I heard it, I figured if I ignored it maybe it would go away. Or perhaps I had misheard. The second time it came up, I made a mental note that I should revisit this at a later date. Let’s do this now.

The question is how should we interpret a probability. So for example, if I want to estimate the probability that a coin will land heads on a single toss how should I construct the experiment? My professors had said that there was no non-circular real world interpretation of what a probability is. At the time, this bothered me because I think of distributions like the Binomial distribution as the simplest types of mathematical models; the mathematical models with the best predictive abilities and with the most reasonable assumptions. Models in mathematical biology, on the other hand, are usually quite intricate with assumptions that are a lot less tractable. My thinking was that if it was impossible to estimate the probability that a coin lands heads on solid philosophical grounds then there was no hope for me, trying to estimate parameters for mathematical models in biology.

Upon further investigation, now I’m not so sure. Below I provide Elliot Sober’s discussion of some of the different interpretations of probabilities (p.61-70).

**1. The relative frequency interpretation.** A probability can be interpreted in terms of how often the event happens within a population of events, i.e., a coin that has a 0.5 probability of landing heads on a single toss will yield 50 heads on 100 tosses.

My view: This interpretation is not good because it’s not precise enough: a fair coin might very well not yield 50 heads on 100 tosses.

**2. Subjective interpretation.** A probability describes the ‘degree of belief that a certain character is true’, i.e., the probability describes the degree of belief we have that the coin will land heads before we toss it.

My view: conceptually, regarding how we interpret probabilities with respect to future events, this is a useful interpretation, but this is not a ‘real world’ interpretation and it doesn’t offer any insight into how to estimate probabilities.

**3. Hypothetical relative frequency interpretation**. The definition of the probability, *p*, is,

Pr(|f-p|>ε)=0 in the limit as the number of trials, n, goes to infinity for all ε>0,

where f is the proportion of successes for n trials. Sober says this definition is circular because a *probability* is defined in terms of a *probability* converging to 0.

My view: This is a helpful conceptual interpretation of what a probability is, but again it’s unworkable as a real world definition because it requires an infinite number of trials.

**4. Propensity interpretation.** Characteristics of the object can be interpreted as translating into probabilities. For example, if the coin has equally balanced mass then it will land heads with probability 0.5. Sober says that this interpretation lacks generality and that ‘propensity’ is just a renaming of the concept of probability and so this isn’t a helpful advance.

My view: This is a helpful real world definition as long as we are able to produce a mechanistic description that can be recast in terms of the probability we are trying to estimate.

So far I don’t see too much wrong with 2-4 and I still think that I can estimate probabilities from data. Perhaps the issue is that Sober wants to understand what a probability is and I just want to estimate a probability from data; our goals are different.

I would go about my task of parameter estimation using maximum likelihood. The likelihood function will tell me the ~~how likely it is~~ likelihood that a parameter (which could be a probability) is equal to a particular value given the data. The likelihood isn’t a probability, but I can generate confidence intervals for my parameter estimates given the data, and similarly, I could generate estimates of the probabilities for different estimates of the parameter. In terms of Sober’s question, understanding what a probability is, I now have a probability of a probability, and so maybe I’m no further ahead (this is the circularity mentioned in 3.). However, for estimating my parameter this is not an issue: I have a parameter estimate (this is a probability) and a confidence interval (that was generated by a probability density).

Maybe… but I’m becoming less convinced that there really *is* a circularity in 3 in terms of understanding what a probability is. I think f(x)=f(x) is a circular definition, but f(f(x)) just requires applying the function twice. It’s a nested definition, not a circular definition. So which is this?

Word for word, this is Sober’s definition:

P(the coin lands heads | the coin is tossed) = 0.5 if, and only if, P(the frequency of heads = 0.5 ± ε | the coin is tossed n times) = 1 in the limit as n goes to infinity,

which he then says is circular because ‘the probability concept appears on both sides of the if-and-only-if’. It is the same *probability concept*, but strictly speaking, the probabilities on either side refer to different events and so while that might not work to understand the *concept* of probability, that definition is helpful for estimating probabilities from relative frequencies if we can only work around the issue of not being able to conduct an infinite number of trials. But for me, that’s how the likelihood framework helps: given a finite number of trials, for most situations we might be interested in we won’t be able to estimate the parameter with 100% certainty and so we need to apply our understanding of what a probability is a second time to reach our understanding of our parameter estimate.

But is that really a circular definition?

I’m not an expert on this, I just thought it was interesting. Is anyone familiar with these arguments?

References

Sober, E. 2000. Philosophy of biology, 2 ed. Westview Press, USA.

I have to admit that I have problems to appreciate the relevance of these subtle differences in the definition of probability. As far as I can see, any probabilistic hypothesis (i.e. a stochastic model, and this is what we are interested in) will usually make a clearly defined prediction, in terms of probability (distributions), for its possible outcomes.

The point where the things are getting messy is when we come to the problem of inference (inverse probability), where we want to make statements about the relative probability of alternative probabilistic hypotheses (differing in parameters or structure) based on observed outcomes, not knowing in advance how many alternative hypotheses there are and whether the “true” model is in our list.

This is the point where the three main modes of statistics (Frequentist, MLE, Bayes) have advocated different approaches. In that respect, I guess one could conclude that statistics “lacks a sound philosophical basis”, but I personally prefer to think of these three approaches as three indicators that are consistent by definition, but simply report different things. They only appear inconsistent when it is wrongly assumed that they report the same thing, namely the probability of a model to be “true” in an absolute sense, which is actually provided by none of them.

I’d be really glad to be able to point to a comprehensible and at the same time mathematically precise review paper on the mathematical / philosophical differences and the historical reasons for going from the Bayesian approach to MLE and the frequentist view, but I haven’t found one so far (any suggestions appreciated). As far as MLE is concerned, I still find the original reference by Fisher one of the best accounts of the reasons for going from Bayes to the MLE view, where Fisher clearly rejects interpreting the integrals of the Likelihood as probabilities and therefore sets to the cornerstones for the current MLE framework with likelihood ratio tests etc (see p. 326).

Fisher, R. A. (1922) On the mathematical foundations of theoretical statistics. Philos. T. Roy. Soc. A., 222, 309-368.

Good comments. Thanks.

I never really got the “circular argument” argument. I’m not a mathematical statistics expert, but the frequentist argument always seemed like a logical extension of the idea of limits from calculus; for a given process, something has a probability p iff, as you look at longer and longer sequences, the number of sequences that have a fraction of events more that p+e different from p make up a smaller and smaller fraction of the total series, going to zero in the limit. One of the nice things is that it allows us also to say that certain mechanisms won’t have a probability associated with them… non-stationary systems for instance, won’t have a well-defined probability of being in a given state after a given amount of time. It also allows us to talk about the probabilities of deterministic systems: for the logistic map, in its chaotic regime, we can define a well-behaved probability of it being in any given state after a long enough time, without having to assume any sort of stochastic system. ((I appologize to any mathematicians in the audience… this reply is incredibly imprecise.)

Eric, thanks for this & your other comment. You’re right that my discussion around the likelihood is not very clean.

Before I started this reading I thought that there was an issue regarding estimating probabilities from relative frequencies because ‘this is circular’. The only real issue I can see is that you’d need an infinite number of trials to estimate a parameter with 100% certainty. I had known that before I started, but somehow I was under the impression that there was another issue too. However, now that I’ve looked into it, I don’t see too much else a miss (which is what you had said too in your comment).

Also, to be a stats pedant: the likelihood function is not “how likely it is that a parameter (which could be a probability) is equal to a particular value given the data”. That would be a posterior. The likelihood is the probability (or probability density) of observing the data for a given set of parameters. It’s a function of the parameters, but that’s because the parameters are the only thing we have any control over picking. It’s asking: “how likely would it be to observe this data, if we assume the underlying parameters are like this?”

Thanks Eric. From Hobbs & Hilborn (2006) it says ‘so, in contrast to the probability statement above, we are now interested in the likelihood (L) of competing hypotheses given the data, which is proportionate to the probability of the data given the hypothesis’. I think I needed to write “the likelihood that a parameter (which could be a probability) is equal to a particular value given the data”. I thought writing ‘how likely’ would invoke the notion of likelihood, but I think you’re right it invokes the notion of probability instead. I’ll revise that.

Yeah, I see what you mean; going through some of my stats books, there are definitely those who insist you should talk about “the likelihood of a parameter”, since the likelihood function is a function of parameters, not data (the data’s assumed constant). Interestingly, it seems that its the Bayesian authors who insist on saying “the likelihood of the parameters”, whereas the frequentists generally just talk about the likelihood function, and don’t talk about what it’s the likelihood of. I still wince at saying “the likelihood of the parameters” since under and reasonable interpretation of “likely” or “likelihood” in general usage, it means “probability”, which can easily lead people into thinking they’re talking about posteriors. It’s one of those cases where using technical language can cause more confusion than not.