The prediction error minimization theory (PEM) says that the brain continually seeks to minimize its prediction error – minimize the difference between its predictions about the sensory input and the actual sensory input. It is an extremely simple idea but from it arises a surprisingly resourceful conception of brain processing. In this post, I’ll try to motivate the somewhat ambitious idea that PEM explains everything about the mind.
The first objection people have when learning about the idea of prediction error minimization is that is must obviously be false. Minimizing prediction error is minimizing surprise, and the best way to minimize surprise, when it comes to sensory input, is to not have any sensory input. If we minimize prediction error we should therefore all seek out dark rooms and stay there. But we obviously don’t, so PEM is false.
This objection rests on a misunderstanding about what the theory says. It is crucial to see that it concerns prediction error minimization on average and in the long run. The brain is doing lots of things to maintain its ability to minimize prediction error reasonably well at the current time and over time. This means that just seeking out the dark room and staying there is not going to work since after a while in the dark room prediction error is going to increase. Hunger, thirst, loneliness are all states we don’t expect in the long run, so they are surprising. Similarly, concerned family members, council workers and landlords are going to come knocking, creating prediction error that staying in the room cannot deal with. (See here for great paper on the dark room).
What the dark room problem tells us is that prediction error minimization always happens given a model, a set of expectations. We will find an organism chronically in the dark room only if this is the kind of creature that on average is expected to be found in a dark room.
There is much more to say about the idea that prediction error minimization always is given a model (see my book and Andy Clark’s terrific BBS paper for introductions). In many ways, here it would make sense to change to talking about the free energy principle and its relation to self-organized systems.
In this post however I will use the idea that prediction error minimization happens over time to motivate the range of activities the brain engages in to safeguard its ability to minimize error.
Assume the brain harbours models of the environmental causes of its sensory input. On the basis of these models, it generates predictions about what the next sensory input will be. These predictions occur concurrently on several time scales, ordered hierarchically up through the cortex. In this way the sensory input is predicted under expectations of what might happen very soon and also under expectations about how what may happen very soon is influenced by what happens at slower time scales (I might expect sensory input as of a leaf dropping to the ground, and that expectation is modulated by my expectations about the windy conditions, and by my expectations about the lighting at this time of day).
Perception happens, then, as the model generates expectations that anticipate the actual sensory input. The parameters of the model predictions will be updated in the light of any prediction error in approximate Bayesian inference: prediction error is weighted by how much we already know (the prior precision), by how much we are learning from the input (the likelihood precision). What we perceive is then determined by the currently best performing predictions.
It is clear that if predictions are not informed by longer-term expectations, then it will be hard to predict sensory input very well (the trajectory of the leaf will be hard to anticipate, and the precision of the sensory input at dusk may be confounding). This follows from the simple observation that we live in a world where causes interact (cf. the famous cat behind the fence). This kind of ‘convolving’ of expectations based on causal regularities at different time scales ensures that perceptual inference, via prediction error minimization, can capture the full, integrated richness of perception. This is made possible by incorporating a long term perspective on prediction error minimization. Here we immediately get a learning (and memory) perspective because building up a hierarchical model then requires extracting causal regularities over various time scales and using them to predict better.
The idea so far is that there is a rich, top-down cascade of interwoven predictions, which seeks to dampen down the sensory input. This can be conceived as driving top-down messages. But we also need a specific type of modulating top-down messages. This is because levels of uncertainty change in the world and thereby the trustworthiness or precision of the bottom-up prediction error changes in a context or state-dependent manner. (To take a long-term example, even if I have learnt to expect certain kinds of rainfalls in certain seasons I might have to adjust this in the light of a shift from La niña to El niño, such that a bit of rain can’t be trusted as a sign of a good season). This is important because if model parameters are to be updated optimally in the light of prediction error then there needs to be an estimation of the precision of that prediction error (it is no good to change the hypothesis in the light of an imprecise measurement). This calls for building up expectations of the precisions of prediction errors, that is, expectations for in which contexts prediction errors tend to be trustworthy. This is just more prediction error minimization, but of a higher statistical order (for example, we can be surprised at the precision of prediction error). Rather interestingly, this gives us attention. This is because attention is allocation of resources to worthwhile signals, and expectations of precisions can guide prediction error efforts to worthwhile or precise signals. There is of course much more to say about this idea but it is enormously appealing because it brings attention in at the ground level and as separate from from, yet intricately related to, perception.
This gives us perception, learning and attention direct from PEM. Next think about what I will (somewhat artificially) call understanding. I conceive of understanding as having a reasonable model for making sense of a domain, even if there is still uncertainty about the states of the domain. The opposite of understanding is confusion, which is not knowing which model can reasonably be appealed to. Whereas perception seeks to minimize uncertainty about what causes the sensory input, understanding is concerned with selecting a reasonable model with which to minimize uncertainty. For example, if you see a dice showing some number of eyes, confusion will ensue if you think of that input under a coin tossing model. It is clear that prediction error minimization is helped by selecting good models since the wrong model will be no good at anticipating the next input. A good model also captures the given input in a minimally complex fashion, without too many unnecessary parameters. If the model is overfitted then it will be poor at predicting what the next series of input is going to be. Overfitting may give decent momentary or short term prediction error minimization but is bound to fail in the long run. Hence we should expect the prediction error minimizing brain to be engaged in model selection and complexity reduction, in other words, that it will aim at understanding.
The last thing to add is action. The whole story here is a little more involved even though the basic idea is utterly simple. If a hypothesis has predictions that don’t hold up, then the hypothesis can be changed to fit the input or the input can be changed to fit the hypothesis. So far the discussion has been about the first direction of fit but of course it is possible to minimize prediction error under the other direction of fit too, and this is action. This is central to PEM (and more generally to the free energy principle). This was implicit in the discussion of the dark room problem above: we need to act in the world to minimize prediction error in the long run. If we ever only update our model parameters in the light of the error, then we will not be able to maintain ourselves in low surprise states (given our model). More concretely, we have to conceive of action in terms of a competition of hypotheses about what the true state of the world is. For example, one (actually true) hypothesis is that my hand is close to the keyboard and another (actually false) hypothesis is that my hand is on the cup of coffee next to me. Action ensues when the actually false hypothesis begins to win, which it does when I increasingly mistrust the actual sensory input: the false hypothesis is then made true by minimizing its prediction error as my hand reaches out for the cup. It sounds rather intricate but is a compelling idea, which does away with cost functions and motor commands. Currently, we are having a lot of fun with this notion of ‘active inference’ both in self-tickle experiments with George van Doorn and reaching tasks in studies of autism spectrum disorder with Colin Palmer, Peter Enticott and Bryan Paton.
So now, from the sparse beginnings of PEM, we get integrated conceptions of perception, learning, attention, understanding and action. Moreover, we get this at multiple, interwoven timescales stretching from the lowest sensory attributes to the most frontal, long-term representation in the brain. Everyone of these aspects of PEM are being investigated in labs around the world.
The PEM mechanism takes care of the problem of perception since high mutual information between brain and world is ensured on the basis of comparison of two quantities that the brain actually has access to, namely predictions and sensory input. It also relates nicely to ideas about what it takes to be a living organism.
Viewed as this kind of package, PEM has the promise to account for everything mental. What more could you want than perception, learning, attention, understanding and action? This promise is strengthened when these aspects of PEM are applied to different areas, such as interoception (yielding emotion) and self (viewed as a parameter that helps explain the evolution of the sensory input), etc. (for more, see my book, Andy Clark’s recent series of papers, work by Anil Seth, and from Karl Friston’s group).
For these reasons, and also for a few other reasons, I think prediction error minimization is all the brain ever does.
Of course, there are very many questions to ask about PEM. Like: what’s the evidence for it? Is it a good thing if something like PEM unifies work on the mind? In what sense does PEM explain, and how is PEM implemented in the brain? And so on. This post is conveniently long enough now however. In the next post, which will hopefully be shorter, the plan is to talk about how PEM might relate to embodied cognition, and then in the last post say a little about consciousness and PEM.