In fitting mathematical models to empirical data, one challenge is that deterministic models make exact predictions and empirical observations usually do not match up perfectly. Changing a parameter value may reduce the distance between the model predictions and some of the data points, but increase the distance to others. To estimate the best-fit model parameters, it is necessary to assign a probability to deviations of a given magnitude. The function that assigns these probabilities is called the error distribution. In this post, I ask:
Do mechanistic, deterministic mathematical models necessarily have to have error distributions that are mechanistically derived?
One of the simplest approaches to model fitting is to use a probability density function, such as the normal or Poisson distribution, for the error function and to use a numerical maximization algorithm to identify the best-fit parameters. The more parameters there are to estimate the more time consuming this numerical search becomes, but in most cases this approach to parameter estimation is successful.
In biology, the processes that give rise to deviations between model predictions and data are measurement error, process error, or both. Some simple definitions are:
- Measurement error: y(t) = f(x(t), b) + e
- Process (or demographic) error: y(t) = f(x(t)+e, b)
where x(t) is a variable, such as the population size at time t, b is a vector of parameters, f(x,b) is the solution to the deterministic model, e is the error as generated by a specified probability density function, and y(t) is the model prediction including the error. As examples, counting the number of blue ducks each year might be subject to measurement error if a major source of error is in correctly identifying the colour of the duck, whereas extreme weather events that affect duckling survivorship are a source of process error.
In the simple approach described above, to keep it simple, I intended to implement the measurement error formulation of the full model. Under this formulation, many of the probability density functions that might be chosen as the error distribution have a process-based interpretation. For example, the normal distribution arises if (1) there are many different types of measurement errors, (2) these errors arise from the same distribution, and (3) total measurement error is the sum of all the errors. In biological data, all of that might be true, to some degree, but in general this explanation is likely incomplete.
A second justification of the simple approach, could be that the error distribution is not intended to be mechanistic, and here, the normal distribution is simply a function that embodies the necessary characteristics – it’s a decreasing function of the absolute value of the deviation. But if you have derived a mechanistic deterministic model, is it really okay to have an error distribution that isn’t justified on mechanistic grounds? Does such an error distribution undermine the mechanistic model formulation to the point where you might as well have started with a more heuristic formulation of the whole model? Would this be called semi-mechanistic – if the model is mechanistic, but the error distribution is heuristic?
If this all seems like no big deal, consider that measurement error does not compound under the effect of the deterministic model, while process error does. When only measurement error operates the processes occur as hypothesized and only the measurements are off. When process error occurs – slightly higher duck mortality than average – there are fewer breeding ducks in the next year, and this change feeds back into the process affecting the predictions made for future years. This makes model fitting to y(t) quite difficult. This is because model fitting is easier when the model and the error can be separated so that numerical methods for solving deterministic models can be used. If the error and the model can not be disentangled then fitting to y(t) will usually involve solving a stochastic model of some sort, which is more difficult, and more time consuming.
An easier alternative for the process error formulation, is to fit using gradient matching. This is because deterministic models are usually differential equations, f'(t) = g(x(t),b). Let z(t) be a transformation of the data, such that, z(t) = [y(t+Δt)-y(t)]/Δt, then we can fit the model as z(t) = g(x(t),b) +e1 where e1 are deviations between the empirical estimate of the gradient and the gradient as predicted by the model. Derivations from the model predicted gradient can be viewed as errors in the model formation or error that arises due to variation in the processes described by the model. If we have a mixture of measurement error and process error then we could do something nice like generalized profiling.
Anyway, this all has been my long-winded response to a couple of great posts about error at Theoretical Ecology by Florian Hartig. I wanted to revisit Florian’s question ‘what is error?’ Is error stochasticity? The latter would mean that e is a random variable, and I have a hard time imagining any good reason why e would not be a random variable. However, I think there are more issues to resolve if we want to understand how to define error. Specifically, how do we decide which processes are subsumed under f(x(t)) and which go under e? Is this a judgment call or should all the deterministic processes be part of f(x(t),b) and all the stochastic processes be put into e and therefore be considered error?