Chapter 01 The Broken Math of Modern AI

The Curse
of Euler

Why the world's most advanced AI runs on the crudest math known to man

Everyone talks about billions of parameters. Nobody talks about the equation. Underneath GPT-4, Claude, and every modern LLM is a single update rule. It is structurally broken in ways that explain almost every failure you've ever seen.

Scroll to begin

AI is Dumb. Here's the Proof.

When people talk about modern large language models: GPT-4, Claude, Gemini, or the heavyweights like GPT-5, they usually talk about scale. They obsess over billions of parameters, trillions of tokens, the "secret sauce" of Reinforcement Learning from Human Feedback (RLHF), and the massive GPU clusters humming in the desert.

Almost nobody talks about the equation.

🧠

This is my crazy work behind the blogs. Digging into the actual mathematics powering every LLM you've ever used, and finding what nobody wants to say out loud.

But every one of these systems runs on the same fundamental update rule. This rule isn't just a clever neural network trick. It is a dynamical system. Specifically, it is a very primitive, very stubborn numerical integrator.

If you want to understand why LLMs drift, why they overcommit, why they double down on wrong interpretations, and why they sometimes hallucinate with the absolute confidence of a person who has never been wrong in their life, you have to look at the math beneath the hood.

§ 1

A Dynamical System Is Just Memory in Motion

Forget AI for a second. Imagine a physical system: a swinging pendulum, a cooling cup of coffee, or a car coasting on a highway.

In each case, there is a state \(x\): position, velocity, temperature — and that state changes over time according to a rule. Mathematically, a continuous dynamical system looks like this:

Continuous Dynamical System
$$\frac{dx}{dt} = f(x)$$

The state \(x(t)\) evolves smoothly. The function \(f\) is the "engine." It tells you how the system changes at every nanosecond.

If you nudge a stable system (like a marble in a bowl), it settles back down. If you nudge an unstable system (like a pencil balanced on its tip), the disturbance grows. Stability is the art of ensuring small errors don't blow up into catastrophes.

Now, think about memory. Memory isn't a static filing cabinet. It's a dynamical system. When you read a sentence, your brain doesn't just "store" words; it evolves an internal "state" of understanding.

Evidence nudges your belief; context dampens your doubts.

A well-behaved memory system moves continuously. It doesn't teleport from "I think it's a cat" to "IT IS DEFINITELY A NUCLEAR SUBMARINE" in a single microsecond.

§ 2

The Discipline of Continuous Time

Take a very simple stable system:

A Perfectly Stable Continuous System
$$\frac{dx}{dt} = -5x$$

This system is "disciplined." Whatever the value of \(x\), it decays smoothly toward zero. The analytical solution is:

Analytical Solution · Pure Exponential Decay
$$x(t) = x_0\,e^{-5t}$$

If you hit this system with a hammer (a perturbation), the energy dissipates. It has built-in damping. No matter how hard you push it, the system always finds its way back to zero.

But computers are digital. They can't do "continuous." They have to take steps.

§ 3

Enter: The Forward Euler Method

The simplest way to approximate a continuous system on a computer is the Forward Euler method. You take the current state, look at the derivative, and jump forward by a "step size" \(h\):

Forward Euler Update Rule
$$x_{k+1} = x_k + h \cdot f(x_k)$$

It's the most basic integration rule in existence. It's also the most dangerous.

When Euler Goes Rogue

Let's apply Euler to our stable system \(dx/dt = -5x\) with a step size of \(h = 1\):

Euler Applied · h = 1, System = −5x
$$x_{k+1} = x_k + (1)(-5x_k) = -4x_k \implies x_k = (-4)^k x_0$$

Instead of decaying toward zero, the system explodes. It flips from positive to negative, growing 4× every single step. A perfectly stable physical reality becomes a chaotic nightmare because the "step" was too big for the "engine."

step 0x =   1.000  // initial state
step 1x = -4.000  // flipped sign already
step 2x = 16.000  // growing, wrong direction
step 3x = -64.000  // catastrophic explosion
truthx = ~0.000  // what actually should happen

The Three Fates of Memory

To understand how memory behaves in this discrete world, we have to look at the growth factor \(|\,1 + h\lambda\,|\), call it Φ. In our discrete update \(x_{k+1} = \Phi\, x_k\), there are exactly three mathematical scenarios for the "memory" of the system:

📉
Vanishing Memory
\(|\Phi| < 1\)
\(|1 + h\lambda| < 1\)
The system forgets perturbations. Errors decay. Noise dies out. Stability.
$$\lim_{k \to \infty} x_k = 0$$
📌
Perfect Persistence
\(|\Phi| = 1\)
\(\lambda = 0\)
The state holds forever. Neither growing nor decaying. Frozen in time.
$$x_k = x_0 \;\;\forall\, k$$
💥
Exploding Memory
\(|\Phi| > 1\)
\(|1 + h\lambda| > 1\)
The system amplifies noise until it destroys the signal. Instability.
$$\lim_{k \to \infty} |x_k| = \infty$$

The critical insight: for Euler to be stable, the step size \(h\) must be tiny enough relative to the system's dynamics. If you ignore this and fix \(h = 1\), you are gambling with stability at every single step.

§ 4

The Transformer Is Forward Euler in Disguise

Now, let's look at the Transformer. If you strip away the fluff, a single layer of a Transformer is defined by its Residual Connection. The state \(x\) — your "thought" or hidden representation — enters a layer, goes through a transformation, and the result is added back to the original state.

The equation for a Transformer layer \(k\), the same one under the hood of GPT-5, is:

Transformer Layer Update
$$x_{k+1} = x_k + f_k(x_k)$$

This is exactly Forward Euler with a fixed step size of \(h = 1\). But to see why this is a curse, we have to rip the lid off that Attention engine \(\text{Attn}(x_k)\) and see how it calculates that nudge.

Engine 1: The Matchmaker (Attention)

Inside the attention block, the model isn't doing "logic." It's doing Importance Matching. Every word in the hidden state is projected into three different spaces:

The Q, K, V Decomposition

Queries (Q): "What am I looking for?" For example, I'm the word Mercury, looking for Space or Chemistry context.

Keys (K): "What do I contain?" For example, I'm the word Orbit, and I carry Space information.

Values (V): "What is my actual semantic content?" The raw data about orbits that gets transferred.

To figure out how much the state should change, the model calculates a Similarity Score — it slams Queries against Keys using a dot product:

Raw Similarity Score
$$\text{Score} = Q \cdot K^T$$

If a Query ("Space") matches a Key ("Orbit"), the score is huge. This tells the model these words are semantically linked. But it needs to commit, so it turns those scores into Attention Weights using a Softmax:

Softmax Attention Weights — d_k = dimension of vector K
$$\alpha = \text{softmax}\!\left(\frac{Q K^T}{\sqrt{d_k}}\right)$$

Mathematically, the full Attention engine output for a sequence is:

Full Attention Output
$$\text{Attn}(Q,K,V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

This is pure word importance matching. If "Orbit" gets a 0.9 weight, the model decides it is the most important context. It then multiplies that weight by the Value \(V\) to produce the final nudge.

Engine 2: The Accelerator (FFNN)

Once Attention has gathered the context, the Feed-Forward Neural Network (FFNN) takes over. If Attention is the "Steering," the FFNN is the "Accelerator."

Mathematically, the FFNN is a point-wise non-linear transformation:

Feed-Forward Network
$$\text{FFN}(x) = \sigma(W_1 x + b_1)\,W_2 + b_2$$

Think of the FFNN as a Specialist. It takes the fuzzy "importance" weights from the Attention engine and sharpens them. It maps the signal into a massive, higher-dimensional space (often 4× the model's hidden dimension) where it can "categorize" the thought.

If the Attention engine says "I'm 51% sure this is about a planet," the FFNN acts as an Amplifier. It applies a squashing function \(\sigma\) that pushes the signal toward a concrete semantic pole. It takes that 51% and says, "Right, then it's definitely a planet. Let's push the state even harder in that direction."

The Opened-Up Update Equation

If we combine everything, the full mathematical "step" of a single layer \(k\) in a model like GPT-5 looks like this. First, the residual + attention sub-layer:

Step 1 · Attention Sub-Layer
$$x'_k = x_k + \text{MultiHead}(\text{LN}(x_k))$$

Where the MultiHead attention is fully expanded as:

MultiHead Attention · Fully Expanded
$$\text{MultiHead}(x) = W_O \left[\text{head}_1 \;\Big\|\; \text{head}_2 \;\Big\|\; \cdots \;\Big\|\; \text{head}_h \right]$$ $$\text{head}_i = \text{softmax}\!\left(\frac{(xW_{Q_i})(xW_{K_i})^T}{\sqrt{d_k}}\right)(xW_{V_i})$$

Then the FFN sub-layer adds on top of that:

Step 2 · FFN Sub-Layer
$$x_{k+1} = x'_k + \text{FFN}(\text{LN}(x'_k))$$ $$\text{FFN}(x) = \sigma(xW_1 + b_1)\,W_2 + b_2$$

Putting it all together, the complete opened-up single-layer update of a Transformer, the actual equation running inside GPT-5, is:

The Full Opened-Up Transformer Update · One Layer of GPT-5
$$\boxed{x_{k+1} = x_k + W_O\!\left[\text{softmax}\!\left(\frac{(x_k^{\text{LN}}W_{Q_i})(x_k^{\text{LN}}W_{K_i})^T}{\sqrt{d_k}}\right)x_k^{\text{LN}}W_{V_i}\right]_i + \sigma\!\left(x_k^{\prime\,\text{LN}}W_1 + b_1\right)W_2 + b_2}$$

where \(x_k^{\text{LN}} = \text{LayerNorm}(x_k)\) and \(x_k^{\prime\,\text{LN}} = \text{LayerNorm}(x_k + \text{MultiHead}(x_k^{\text{LN}}))\)

Comparing this to the Euler formula \(x_{k+1} = x_k + h \cdot f(x_k)\), the model is forced into a fixed step size \(h = 1\). Every matrix multiply, every softmax, every nonlinearity — the entire result gets injected into the memory stream at full force:

The Euler Identity · h is Permanently Fixed at 1
$$x_{k+1} = x_k + \underbrace{1}_{h=1} \cdot \underbrace{f_k(x_k)}_{\text{all of the above}}$$
😅

Sorry for the maths. This single equation is the reason for almost every LLM failure mode you've ever witnessed. Worth the pain.

§ 5

The Problem: Bias by Design

Notice something? This math is just a "dating app" for vectors. Nothing in this equation stops the system from taking a bias.

The Two Engines: Working Together Toward Disaster

In a Transformer, these two engines work together to produce that "nudge" \(f_k\):

Every single layer, the model takes that nudge and performs a full Euler step. There is no brake. There is no damping. There is only the step.

§ 6

Why This Leads to Hallucination

If you linearize this update rule, the stability depends on the eigenvalues of the system. For the state to stay "sane," the eigenvalues \(\lambda\) must satisfy:

Stability Condition · What Must Be True
$$|1 + \lambda| < 1$$

But LLMs aren't trained for stability. They are trained for accuracy. If even one direction in the math has a \(\lambda > 0\), the state along that direction will amplify exponentially. Nobody is enforcing the stability condition during training, so these unstable eigenvalues are everywhere.

The "Mercury" Example

Imagine the prompt: "Explain why Mercury is important."

Planet Bias Amplification Across Transformer Layers
L1–10
State neutral — tiny planet lean from training
+2% planet
L20
Euler adds nudge — Query is now planet-biased
+18% planet
L40
Q·Kᵀ for planet-words grows — self-reinforcing
+45% planet
L80
Amplified 80× — dominant, unstoppable signal
+90% planet

By the time the model reaches the final layer, it isn't just "considering" that Mercury is a planet. It is convinced. If you were asking about the element, the model has overshot reality. Hallucination is iterative amplification.

§ 7

The Essence of the Curse

The Forward Euler method is the "junk food" of numerical integration. It's fast, cheap, and easy to parallelize, which is why we can run GPT-5. But the cost is structural:

Three Structural Flaws

It is conditionally stable: Sharp logic makes the state explode. If the eigenvalues of \(f_k\) push \(|1+\lambda|\) above 1 in any direction, that direction amplifies without bound.

It overshoots: The model commits to an interpretation far earlier than it should because \(h = 1\) is a massive jump. There is no cautious, small step. Only the full plunge every time.

It lacks "friction": Real physical systems have a \((1 - \gamma)\) damping term to pull errors back toward equilibrium. Transformers have no such term. They just pile nudge on top of nudge, layer after layer.

"We have built the most sophisticated thinking machines in history on top of the crudest numerical solver known to man. We've given the AI a Ferrari engine but fixed the steering wheel so it can only move in discrete, 1-meter jumps. And we've removed the brakes. Everything begins here."