
Key insights:
Backpropagation powers virtually all of modern machine learning. It works brilliantly in silicon. But the brain almost certainly uses a different approach. The reasons come down to two fundamental biological constraints that backpropagation simply cannot satisfy.
When you have a system with millions of adjustable parameters, like connection weights between neurons, you need to figure out which ones to change and by how much. This is the credit assignment problem.
Artificial neural networks solve this elegantly through calculus. They use the chain rule of derivatives to calculate precisely how each parameter should be nudged to improve performance. This process is called automatic differentiation, and it forms the backbone of backpropagation.
The brain faces the same challenge. Every time you learn something new, your synapses need to adjust. But the brain doesn't have access to the same mathematical machinery that computers use. It needs a different solution.
Backpropagation requires strictly separated phases. First, information flows forward through the network. Then an error is calculated at the output. Then that error travels backward, layer by layer, to update weights.
For this to work, neurons must freeze their activity values while error signals propagate backward. The brain doesn't do this. Communication in biological tissue is slow compared to silicon processors. If the brain followed backpropagation's approach, it would need to stop processing information for hundreds of milliseconds during each learning step.
Imagine experiencing brief blackouts every time you learned something new. That doesn't happen. Biological brains process information and learn simultaneously in a continuous stream. There is no evidence for separate forward and backward phases.
Backpropagation requires a central controller to switch the entire network between forward and backward modes. Errors must propagate in a precise temporal sequence. You cannot compute errors for a given neuron before its downstream partners have finished their own calculations.
Everything we know about brain physiology suggests this kind of global coordination is extremely unlikely. While the brain has some coordinating mechanisms like theta and gamma oscillations and neuromodulators like dopamine, these operate at much coarser scales than backpropagation requires.
Individual neurons and synapses mostly function as autonomous agents. They modify their states based solely on information physically available at their specific locations. The brain is a massively parallel, locally autonomous system. As explored in detail by Lillicrap et al. (2020), these constraints make direct implementation of backpropagation in neural tissue virtually impossible.
Predictive coding offers an alternative that respects the brain's biological constraints. It originated from mid-20th century research proposing that the brain's fundamental objective is to predict incoming sensory information. Let's build it up step by step.
From an evolutionary perspective, prediction helps survival. An organism that can anticipate threats and interpret noisy observations has a clear advantage. There's also an efficiency argument. Neural activity demands considerable metabolic energy. A brain that predicts incoming signals only needs to process unexpected information.
In this view, the brain's primary task isn't simply processing stimuli. It's constructing an internal model that explains sensory inputs. When predictions are accurate, minimal processing is required. When predictions fail, the resulting prediction errors signal that the internal model needs updating.
Predictive coding formalizes this as a hierarchical system. Each neural layer attempts to predict the activity of the layer below it. The lowest level corresponds to raw sensory input. Higher levels encode increasingly abstract features. Top-down connections carry predictions. Bottom-up connections carry prediction errors. This framework was originally proposed by Rao and Ballard (1999) in their landmark paper on predictive coding in the visual cortex.
Predictive coding networks are energy-based models. Each possible network state gets assigned a single number representing abstract energy. The system then evolves to reduce this energy, just like a ball rolling downhill.
The energy relates to the total magnitude of prediction errors across the network. Think of each neuron as a node on a post. Its height represents its activity level. On the same post sits a platform representing its predicted activity, determined by neurons from the layer above. A spring connects the node and the platform.
When a neuron's activity deviates from its predicted value, the spring stretches and energy increases. The total energy sums the squared errors across all neurons in every layer. The network's objective is to minimize this total prediction error by finding the optimal configuration of neural activities and connection weights.
Each neuron adjusts its activity by moving in the direction that most steeply reduces total energy. This is gradient descent, but applied locally. When you work out the math, each neuron's activity update depends on just two things.
First, its own prediction error drives it to align with its top-down prediction. Second, the prediction errors from the layer below encourage it to better predict downstream activity. These two forces compete until the neuron finds a compromise, an optimal activity level that minimizes prediction errors both at its own layer and the layer it helps predict.
The key insight is that this requires a separate population of error neurons that explicitly encode prediction errors. This is the origin of the term