Chapter 1 Only Got Us Halfway There
In Chapter 1, we proved something uncomfortable. Strip away everything, and a single Transformer layer is just Forward Euler integration with step size permanently fixed at \(h = 1\):
We showed why this is a curse. No adaptive step size. No damping. The full force of everything the attention mechanism and the FFN learned gets injected into the hidden state at every layer. That's why early ambiguity amplifies. That's why "Mercury" becomes a planet by layer 80 whether you wanted it to or not.
But here's what I didn't tell you in Chapter 1.
That was about the method being broken. Euler is bad. Fine. But there's a deeper problem. Even if you swapped Euler for a perfect integrator, you would still have a catastrophically broken system. Because the thing being integrated has no business being a memory system at all.
The dynamical system itself violates every mathematical condition that any stable memory system must satisfy. Not some of them. Not most of them. Every single one.
That's what this chapter is about.
This is the chapter I find most interesting to write. Because once you see these five failures, you stop being surprised by anything the Transformer does wrong. The failure modes aren't bugs. They're consequences. Every single one is mathematically inevitable.
What Does a Stable Memory System Even Look Like?
Before we can say the Transformer fails, we need to know what passing looks like. What are the actual mathematical requirements of a system that evolves memory in a sound way?
Let's be precise. The Transformer's hidden state is a vector \(x_k \in \mathbb{R}^d\). Each layer applies a transformation to it:
This is a nonlinear discrete-time dynamical system. The state \(x_k\) is the "memory" at layer \(k\). The function \(T_k\) is the layer's transition map. And \(f_k\) is everything that happens inside the layer: the attention mechanism, the LayerNorm, the FFN, all of it combined into one big learned nonlinear function.
Now ask the right question. What do we actually need from this system? We need the final hidden state \(x_L\) to faithfully represent the semantic content of the input. Not an amplified version of the first thing the model found interesting. Not a state that's been inflated by 96 layers of energy injection. A faithful, stable representation.
For that, the system needs five things. Think of them as five locks that a real memory system must pass through. The Transformer fails every one.
Contraction Mappings: The System Has No Self-Correction
What Contraction Means
Imagine you have two slightly different interpretations of the same ambiguous sentence. Two hidden states, \(x\) and \(y\), starting close together but not identical. In a well-behaved memory system, those two states should get closer with each processing step, not further apart. Ambiguity should resolve, not compound.
The mathematical name for this property is contraction. A mapping \(T\) is a contraction if there exists some constant \(\alpha \in [0, 1)\) such that for every pair of states:
The constant \(\alpha\) is called the Lipschitz constant. When it is strictly less than 1, every application of the map squeezes the space a little. Two nearby states become even nearer. Apply the map again and they are nearer still. Keep going and they converge to the same point.
The Banach Fixed Point Theorem tells us the consequence. Any strict contraction on a complete metric space has exactly one fixed point, and starting from anywhere, repeated application converges there. It doesn't matter if you start with a noisy input, a wrong initial bias, or a confused early representation. The contraction wipes it out. You converge to the right answer regardless.
That is what disciplined memory looks like.
Picture a funnel. You drop two marbles in from different positions anywhere inside it. It doesn't matter where they started. Gravity and the funnel's shape force them toward the same exit point. The funnel is a contraction. The two marbles are your two interpretations. The exit is the correct answer. A stable memory system is a funnel. The Transformer is not a funnel.
Why the Transformer Cannot Possibly Be a Contraction
Apply the Transformer map \(T(x) = x + f(x)\) to two states \(x\) and \(y\). Compute the difference between their outputs:
The residual connection carries the original gap through unchanged and adds whatever the learned part contributes on top. For this to be a contraction, we would need the learned correction \((f(x) - f(y))\) to be large, negative, and precisely anti-parallel to \((x - y)\). The attention mechanism and the FFN would need to automatically detect the gap between any two possible states and work to close it.
Nothing in the architecture does that. Nothing in the loss function trains for it. The attention mechanism is designed to find relevant context and pull it in. The FFN is designed to amplify semantic signals it finds important. Neither of them cares about the gap between two hypothetical states. The architecture has no concept of "close this gap." It only knows "add this information."
A Contractive System
Two nearby interpretations evolve toward each other. Initial noise and ambiguity get squeezed out across steps. The system self-corrects. Wrong inputs get pulled back toward right ones.
The Transformer
\(T(x) = x + f(x)\) adds differences rather than damping them. Two interpretations that start slightly different can grow more different with every layer. Initial bias compounds. There is no self-correction built in anywhere.
This is exactly why the first few tokens of a prompt are so disproportionately powerful. They set a direction. That direction gets added to the residual stream at layer 1. No contraction pulls it back at layer 2. So it's still there at layer 3, slightly amplified. Still there at layer 10. By layer 80, the early context has been reinforced so many times it is effectively immovable. The Transformer has no mechanism to correct early bias. It only has mechanisms to propagate it.
Lyapunov Stability: The System Has No Energy Budget
The Most Powerful Idea in Dynamical Systems
In 1892, a Russian mathematician named Aleksandr Lyapunov came up with an idea that changed everything. You don't need to solve the equations of a system to prove it's stable. You just need to find a single scalar function of the state, something like "energy" or "distance from equilibrium," and show it can only decrease over time. If such a function exists, the system must be converging. It has no choice.
Formally, a Lyapunov function for a discrete system \(x_{k+1} = T(x_k)\) is a function \(V : \mathbb{R}^d \to \mathbb{R}_{\geq 0}\) satisfying:
The third condition is everything. The "energy" \(V\) cannot increase from step to step. If it strictly decreases everywhere except the equilibrium point, the system is asymptotically stable: it will converge, no matter where it starts, no matter what you throw at it.
The bowl is the Lyapunov function made physical. Potential energy is the function \(V\). Gravity ensures \(V\) can only decrease. No matter where you place the marble inside the bowl, it rolls toward the bottom. It cannot roll up the side and escape. The bowl's geometry enforces convergence completely, without you doing any work. A stable memory system needs this bowl-like geometry in its state space. The Transformer's state space is flat. There is no bowl. There is not even a rim.
The Calculation That Breaks Everything
Let's try to find a Lyapunov function for the Transformer. Start with the most natural candidate: the squared norm of the hidden state, \(V(x) = \|x\|^2\). This would represent the total "energy" of the representation. For the Transformer to be Lyapunov stable, this must be non-increasing.
Compute what happens to it after one Transformer layer:
Expand using \(\|a + b\|^2 = \|a\|^2 + 2\langle a, b \rangle + \|b\|^2\):
Stop at that last term. \(\|f_k(x_k)\|^2\) is a squared norm. It is always non-negative. Not sometimes. Not in bad cases. Always. It doesn't matter what the weights are. It doesn't matter what the input is. Every single layer, this term injects energy into the system unconditionally.
For the total energy to decrease, the cross term \(2\langle x_k, f_k(x_k)\rangle\) would have to be sufficiently negative. That means \(f_k(x_k)\) would have to point opposite to \(x_k\), aggressively fighting against the current state.
Think about what \(f_k(x_k)\) actually does. It gathers semantically related context and adds it. It amplifies signals the model finds important. It is adding relevant information that points in the same direction as the current state. That inner product is almost always positive. The attention mechanism is literally designed to make it positive.
There is no Lyapunov function. Not because the analysis gets complicated. Because the architecture structurally injects energy at every layer with nothing to remove it.
What About Layer Normalization?
This is the obvious objection. Layer Norm is in every Transformer. It rescales the hidden state to have unit norm. Doesn't that fix the energy problem?
No. And the reason is important.
Layer Norm is a scalar operation. It changes the length of the vector. It says nothing about the direction. Two hidden states that are geometrically drifting apart in opposite directions of the \(d\)-dimensional semantic space are not brought closer together by rescaling either of them. You're just making two diverging arrows shorter. They are still diverging.
The Lyapunov violation is a disease of direction. Layer Norm is a patch for magnitude. They are not the same problem, and one does not fix the other.
Spectral Instability: Some Directions Just Explode
Reading the Eigenvalues
The previous two failures were about the global picture. No contraction, no Lyapunov function. But spectral analysis lets us go further and identify the specific directions in state space where things blow up.
Near any particular hidden state \(x_k\), linearize the Transformer update using its Jacobian \(J_{f_k} = \frac{\partial f_k}{\partial x}\big|_{x_k}\). The local behavior of the update looks like:
The effective transition matrix is \((I + J_{f_k})\). Its eigenvalues are:
For a discrete system \(x_{k+1} = M x_k\) to be stable, all eigenvalues of \(M\) must lie strictly inside the unit circle:
For the Transformer, satisfying this would require every eigenvalue of \(J_{f_k}\) to fall inside a disk of radius 1 centered at \(-1\) in the complex plane. Think about what that constraint means geometrically. You're asking the Jacobian of the learned transformation to have a very specific and tightly restricted spectrum. Not just "don't be too big." Centered at negative one.
Nothing enforces this. Not the architecture. Not the training loss. Predicting the next token accurately has essentially nothing to do with having a spectrally constrained Jacobian. GPT-5 was not trained to satisfy the unit circle criterion. It was trained to be good at language. Those are very different objectives, and optimizing one tells you almost nothing about the other.
What Happens When an Eigenvalue Escapes
When \(|\mu_i| > 1\) for some eigendirection, the hidden state component along that direction gets multiplied by \(|\mu_i|\) at every layer. After \(L\) layers, it has grown by \(|\mu_i|^L\). This is exponential amplification. And the LLM has no mechanism to stop it.
This directly causes three things you've definitely seen in production:
Hallucination that can't be corrected mid-conversation. A hallucinated fact enters the residual stream with a non-trivial component along an unstable eigendirection. By the time the model finishes generating that sentence, the hallucinated fact has been amplified many times. Contradictory information that arrives later has to fight exponential amplification to update the state. It loses. The model doubles down.
Disproportionate sensitivity to early context. The first tokens set eigendirections early. Those directions get amplified at every layer. By layer 80, the initial semantic push has been multiplied by \(|\mu_i|^{80}\). Later context can barely move the final state. This is why prompt phrasing matters so absurdly much. It's not about what the model "understands." It's about which eigenmodes get activated first.
Getting worse on long-context tasks. The deeper the model, the more layers available to amplify unstable modes. Very deep models don't just add more computation. They add more amplification of whatever structural instabilities exist. Depth makes the spectral problem worse, not better.
These three failure modes look completely different on the surface. Different symptoms, different use cases, different user complaints. But they have the same root cause: eigenvalues outside the unit circle, amplifying without bound, with nothing to stop them.
Dissipativity and Bounded Evolution: What the Real World Does Differently
To appreciate how badly unconstrained the Transformer is, you need to see what a properly constrained memory system looks like. Real physical and engineered systems don't stabilize by accident. The stability is structural. Someone put it there.
The Linear State-Space System
The canonical model for memory in physics and engineering is the linear state-space system:
The matrix \(A\) is the system matrix. Everything about the system's stability lives in \(A\). For memory to fade when input stops, which is the only sane behavior, all eigenvalues of \(A\) must have strictly negative real parts:
In a control system, an engineer picks \(A\) to satisfy this before writing a single line of code. In biological neural circuits, evolution selects for it. The spectral property of \(A\) is not an emergent property of running the system for a long time on lots of data. It is a design requirement. It has to be true from the start, or the system will blow up.
Dissipation Is the Mechanism, Not the Bug
Physical memory systems lose energy. Not because they're imperfect but because losing energy is what makes them stable. This is called dissipativity, and a dissipative system satisfies:
The term \(-\alpha V(x)\) is natural decay. Without any input \(u\), energy decreases exponentially at rate \(\alpha\). The term \(\beta\|u\|^2\) is the energy coming in from external input. Memory is sustained only when it's actively being driven. When the input stops, the memory fades.
This sounds like a weakness. It isn't. A memory system that retains everything forever with no input isn't stable. It's saturated. It can't update. Old information blocks new information. Forgetting is not a flaw in physical memory systems. It's the structural feature that makes the whole thing work.
Physical Memory System
Memory decays unless actively driven. Spectrum is designed for selective retention. Energy is bounded by input power. Dissipation enforces stability. Forgetting is structural and guaranteed by the mathematics.
Transformer Memory
Memory accumulates without bound. No spectral design. Energy grows with depth. No dissipation mechanism. Forgetting is accidental, not guaranteed by anything. Old context and new context compete without a principled resolution mechanism.
The Transformer makes no distinction between "sustain this memory because it's still relevant" and "add this new information." It just adds. Every layer. Every token. The architecture has no concept of forgetting by design. Everything that enters the residual stream stays there and gets amplified.
Bounded Evolution: The Fifth Failure
The final condition is the most concrete. A stable memory system must keep its state within a bounded region of space. States that wander off to infinity cannot represent meaningful content. They are numerical garbage.
For a physical system with stable \(A\) and bounded input, this is guaranteed by the spectral property. You get a concrete exponential bound on how far the state can wander. You can write down the constant \(C\) from first principles.
The Transformer has no such bound. Layer Norm is the only mechanism that limits state magnitude, and Layer Norm is applied within sub-components, not globally. The overall hidden state can and does grow with depth. There is no derivable constant \(C\) that holds in general. The bound, if it exists at all for a given model, is an empirical accident of training, not a structural guarantee.
How S4 Actually Solves This
The five failures of the Transformer are not just theoretical observations. They directly motivated a new class of architectures: Structured State Space Models. S4, then S5, then Mamba. These models were built by researchers who took the dynamical systems requirements seriously first and designed the architecture to satisfy them from the ground up.
Start With a Stable Matrix
S4 begins with the continuous linear state-space system. Same equations we wrote above. The key is the choice of \(A\). S4 uses the HiPPO matrix, which stands for High-order Polynomial Projection Operator. It's a carefully constructed matrix where the hidden state continuously maintains an optimal approximation of the input history using orthogonal polynomials.
Crucially, the HiPPO matrix is designed so that all its eigenvalues have strictly negative real parts. This is not a training outcome. This is a mathematical property of the matrix that researchers proved analytically before any training happened.
Discretize Correctly
To apply S4 to discrete sequences, you need to discretize the continuous system. S4 uses the matrix exponential, not Euler:
Now compare the two approaches side by side:
Transformer · Forward Euler
Transition matrix is \(I + J_{f_k}\). Eigenvalues are \(1 + \lambda_i\). No constraint keeps them inside the unit circle. Stability is not enforced anywhere.
S4 · Matrix Exponential
Transition matrix is \(e^{A\Delta t}\). Because \(\text{Re}(\lambda_i(A)) < 0\), all eigenvalues of \(e^{A\Delta t}\) satisfy \(|\cdot| < 1\). Stability is structural.
The reason this works is a clean piece of mathematics. When \(A\) has eigenvalues with negative real parts:
No matter what \(\Delta t\) you pick, as long as \(A\) is stable, every eigenvalue of \(\bar{A}\) has magnitude strictly less than 1. The S4 update is a genuine contraction. It has a Lyapunov function. It has spectral stability. It is dissipative. It maintains bounded evolution. It satisfies all five conditions the Transformer violates.
This is not luck. The S4 researchers sat down with the mathematics of dynamical systems, wrote down the requirements, and built an architecture that satisfies them. That's the entire point. You can design stability in, or you can leave it out and wonder why your model keeps breaking.
Five Conditions. Zero Satisfied.
Let's put the complete picture in one place. These are the five conditions. Any system claiming to be a reliable memory system needs all of them. The Transformer satisfies zero of them by structural design. S4 satisfies all five by structural design.
| Condition | Mathematical Requirement | Transformer | S4 / SSM |
|---|---|---|---|
| Contraction | \(\|T(x)-T(y)\|\leq\alpha\|x-y\|,\;\alpha<1\)< /td> | ✗ Not enforced | ✓ Structural |
| Lyapunov Stability | \(\exists\,V:\;V(T(x))\leq V(x)\) | ✗ Norm always grows | ✓ By design |
| Spectral Stability | \(|\mu_i|<1\) for all eigenvalues | ✗ Unconstrained | ✓ By construction |
| Dissipativity | \(\dot{V}\leq-\alpha V+\beta\|u\|^2\) | ✗ No decay term | ✓ Via stable A |
| Bounded Evolution | \(\|x_k\|\leq C\) for all \(k\) | ✗ LayerNorm is a patch | ✓ Exponential bound |
But It Works. How?
At this point you are probably thinking: if all five conditions fail, why does GPT-5 work at all? It clearly produces useful outputs. It clearly doesn't explode most of the time. Something must be keeping it together.
You're right. Three things partially save it in practice, and it's worth being honest about what they are and what they aren't:
- Finite depth puts a ceiling on the damage. Unstable eigenmodes can only get amplified as many times as there are layers. A 96-layer model amplifies things 96 times, not infinitely. The instability is bounded by the architecture. But it is still there, it grows with depth, and "bounded by 96 layers of amplification" is not the same as "stable."
- Training data has enormous structure. Human language is deeply patterned. The optimization process accidentally induces approximate stability in the most common linguistic regimes because the training distribution forces it. The Jacobians of the learned functions happen to have mostly well-behaved eigenvalues on the training distribution. But this is emergent and fragile, not structural and guaranteed. Move off-distribution and it falls apart.
- Softmax provides partial damping. Attention weights sum to 1. This limits how much any single token can dominate the attention output. It is a partial numerical brake. It does not enforce contraction. It does not create a Lyapunov function. But it does prevent the most extreme forms of single-direction explosion in the attention mechanism specifically.
These three things together mean the Transformer mostly works within a well-defined comfortable regime. Reasonable prompts, reasonable context lengths, tasks similar to training data. Within that regime, the structural failures are present but often not severe enough to manifest obviously.
Push outside it and the mathematics takes over. Longer contexts. Adversarial prompts. Tasks requiring genuine multi-step memory. Novel domains. Novel reasoning structures. Anywhere the training distribution patches stop covering the structural failures, you get the failures. Hallucinations that compound instead of correcting. Context that drops after a few thousand tokens. Confident wrong answers that double down under questioning.
These are not bugs that will get fixed with more data or more parameters. They are theorems. They are consequences of the mathematical structure of the architecture. They will keep happening as long as the architecture violates these five conditions. More scale makes them worse in deep models, not better.
I want to be precise about what this chapter is and isn't saying. It isn't saying the Transformer is useless. It clearly isn't. It's saying that its failure modes are mathematically inevitable, and that understanding the math is the only way to actually fix them rather than patching them with more compute and hoping for the best.