AI is Dumb. Here's the Proof.
When people talk about modern large language models: GPT-4, Claude, Gemini, or the heavyweights like GPT-5, they usually talk about scale. They obsess over billions of parameters, trillions of tokens, the "secret sauce" of Reinforcement Learning from Human Feedback (RLHF), and the massive GPU clusters humming in the desert.
Almost nobody talks about the equation.
This is my crazy work behind the blogs. Digging into the actual mathematics powering every LLM you've ever used, and finding what nobody wants to say out loud.
But every one of these systems runs on the same fundamental update rule. This rule isn't just a clever neural network trick. It is a dynamical system. Specifically, it is a very primitive, very stubborn numerical integrator.
If you want to understand why LLMs drift, why they overcommit, why they double down on wrong interpretations, and why they sometimes hallucinate with the absolute confidence of a person who has never been wrong in their life, you have to look at the math beneath the hood.
A Dynamical System Is Just Memory in Motion
Forget AI for a second. Imagine a physical system: a swinging pendulum, a cooling cup of coffee, or a car coasting on a highway.
In each case, there is a state \(x\): position, velocity, temperature — and that state changes over time according to a rule. Mathematically, a continuous dynamical system looks like this:
The state \(x(t)\) evolves smoothly. The function \(f\) is the "engine." It tells you how the system changes at every nanosecond.
If you nudge a stable system (like a marble in a bowl), it settles back down. If you nudge an unstable system (like a pencil balanced on its tip), the disturbance grows. Stability is the art of ensuring small errors don't blow up into catastrophes.
Now, think about memory. Memory isn't a static filing cabinet. It's a dynamical system. When you read a sentence, your brain doesn't just "store" words; it evolves an internal "state" of understanding.
Evidence nudges your belief; context dampens your doubts.
A well-behaved memory system moves continuously. It doesn't teleport from "I think it's a cat" to "IT IS DEFINITELY A NUCLEAR SUBMARINE" in a single microsecond.
The Discipline of Continuous Time
Take a very simple stable system:
This system is "disciplined." Whatever the value of \(x\), it decays smoothly toward zero. The analytical solution is:
If you hit this system with a hammer (a perturbation), the energy dissipates. It has built-in damping. No matter how hard you push it, the system always finds its way back to zero.
But computers are digital. They can't do "continuous." They have to take steps.
Enter: The Forward Euler Method
The simplest way to approximate a continuous system on a computer is the Forward Euler method. You take the current state, look at the derivative, and jump forward by a "step size" \(h\):
It's the most basic integration rule in existence. It's also the most dangerous.
When Euler Goes Rogue
Let's apply Euler to our stable system \(dx/dt = -5x\) with a step size of \(h = 1\):
Instead of decaying toward zero, the system explodes. It flips from positive to negative, growing 4× every single step. A perfectly stable physical reality becomes a chaotic nightmare because the "step" was too big for the "engine."
The Three Fates of Memory
To understand how memory behaves in this discrete world, we have to look at the growth factor \(|\,1 + h\lambda\,|\), call it Φ. In our discrete update \(x_{k+1} = \Phi\, x_k\), there are exactly three mathematical scenarios for the "memory" of the system:
\(|1 + h\lambda| < 1\)
\(\lambda = 0\)
\(|1 + h\lambda| > 1\)
The critical insight: for Euler to be stable, the step size \(h\) must be tiny enough relative to the system's dynamics. If you ignore this and fix \(h = 1\), you are gambling with stability at every single step.
The Transformer Is Forward Euler in Disguise
Now, let's look at the Transformer. If you strip away the fluff, a single layer of a Transformer is defined by its Residual Connection. The state \(x\) — your "thought" or hidden representation — enters a layer, goes through a transformation, and the result is added back to the original state.
The equation for a Transformer layer \(k\), the same one under the hood of GPT-5, is:
This is exactly Forward Euler with a fixed step size of \(h = 1\). But to see why this is a curse, we have to rip the lid off that Attention engine \(\text{Attn}(x_k)\) and see how it calculates that nudge.
Engine 1: The Matchmaker (Attention)
Inside the attention block, the model isn't doing "logic." It's doing Importance Matching. Every word in the hidden state is projected into three different spaces:
Queries (Q): "What am I looking for?" For example, I'm the word Mercury, looking for Space or Chemistry context.
Keys (K): "What do I contain?" For example, I'm the word Orbit, and I carry Space information.
Values (V): "What is my actual semantic content?" The raw data about orbits that gets transferred.
To figure out how much the state should change, the model calculates a Similarity Score — it slams Queries against Keys using a dot product:
If a Query ("Space") matches a Key ("Orbit"), the score is huge. This tells the model these words are semantically linked. But it needs to commit, so it turns those scores into Attention Weights using a Softmax:
Mathematically, the full Attention engine output for a sequence is:
This is pure word importance matching. If "Orbit" gets a 0.9 weight, the model decides it is the most important context. It then multiplies that weight by the Value \(V\) to produce the final nudge.
Engine 2: The Accelerator (FFNN)
Once Attention has gathered the context, the Feed-Forward Neural Network (FFNN) takes over. If Attention is the "Steering," the FFNN is the "Accelerator."
Mathematically, the FFNN is a point-wise non-linear transformation:
Think of the FFNN as a Specialist. It takes the fuzzy "importance" weights from the Attention engine and sharpens them. It maps the signal into a massive, higher-dimensional space (often 4× the model's hidden dimension) where it can "categorize" the thought.
If the Attention engine says "I'm 51% sure this is about a planet," the FFNN acts as an Amplifier. It applies a squashing function \(\sigma\) that pushes the signal toward a concrete semantic pole. It takes that 51% and says, "Right, then it's definitely a planet. Let's push the state even harder in that direction."
The Opened-Up Update Equation
If we combine everything, the full mathematical "step" of a single layer \(k\) in a model like GPT-5 looks like this. First, the residual + attention sub-layer:
Where the MultiHead attention is fully expanded as:
Then the FFN sub-layer adds on top of that:
Putting it all together, the complete opened-up single-layer update of a Transformer, the actual equation running inside GPT-5, is:
where \(x_k^{\text{LN}} = \text{LayerNorm}(x_k)\) and \(x_k^{\prime\,\text{LN}} = \text{LayerNorm}(x_k + \text{MultiHead}(x_k^{\text{LN}}))\)
Comparing this to the Euler formula \(x_{k+1} = x_k + h \cdot f(x_k)\), the model is forced into a fixed step size \(h = 1\). Every matrix multiply, every softmax, every nonlinearity — the entire result gets injected into the memory stream at full force:
Sorry for the maths. This single equation is the reason for almost every LLM failure mode you've ever witnessed. Worth the pain.
The Problem: Bias by Design
Notice something? This math is just a "dating app" for vectors. Nothing in this equation stops the system from taking a bias.
- The \(Q \cdot K^T\) score only cares about what the model thinks is relevant based on training data.
- If the training data slightly favors "Planet" over "Element," the score for "Planet" will always be higher.
- Crucially, there is no "Fact Check" or "Stabilizer" step. The attention mechanism just says, "This looks important!" and hands that vector directly to the Euler update rule.
The Two Engines: Working Together Toward Disaster
In a Transformer, these two engines work together to produce that "nudge" \(f_k\):
- The Attention Mechanism (The Steering): It gathers context using the \(Q,K,V\) math above. It looks at all other tokens and decides the direction the thought should move.
- The FFNN (The Acceleration): Once Attention has gathered the context, the FFNN processes it through a high-dimensional "intuition engine" that pushes the state further along a semantic path.
Every single layer, the model takes that nudge and performs a full Euler step. There is no brake. There is no damping. There is only the step.
Why This Leads to Hallucination
If you linearize this update rule, the stability depends on the eigenvalues of the system. For the state to stay "sane," the eigenvalues \(\lambda\) must satisfy:
But LLMs aren't trained for stability. They are trained for accuracy. If even one direction in the math has a \(\lambda > 0\), the state along that direction will amplify exponentially. Nobody is enforcing the stability condition during training, so these unstable eigenvalues are everywhere.
The "Mercury" Example
Imagine the prompt: "Explain why Mercury is important."
- Layer 1–10: The state is neutral. But the attention mechanism finds a tiny bit more "space" context than "chemistry" context.
- Layer 20: Because the Euler update is \(x_{k+1} = x_k + f_k\), that tiny "Planet" nudge is added to the state. The Query for the next layer is now slightly more "Planet-biased."
- Layer 40: The "Planet" direction isn't just maintained, it's reinforced. The Similarity Score \(Q \cdot K^T\) for planet-related words grows even larger.
- Layer 80: The "Planet" signal has been amplified 80 times. It is now a dominant, unstoppable force.
By the time the model reaches the final layer, it isn't just "considering" that Mercury is a planet. It is convinced. If you were asking about the element, the model has overshot reality. Hallucination is iterative amplification.
The Essence of the Curse
The Forward Euler method is the "junk food" of numerical integration. It's fast, cheap, and easy to parallelize, which is why we can run GPT-5. But the cost is structural:
It is conditionally stable: Sharp logic makes the state explode. If the eigenvalues of \(f_k\) push \(|1+\lambda|\) above 1 in any direction, that direction amplifies without bound.
It overshoots: The model commits to an interpretation far earlier than it should because \(h = 1\) is a massive jump. There is no cautious, small step. Only the full plunge every time.
It lacks "friction": Real physical systems have a \((1 - \gamma)\) damping term to pull errors back toward equilibrium. Transformers have no such term. They just pile nudge on top of nudge, layer after layer.