How AI Works — An Interactive Guide

Chapter 1

Foundations

No math degree required. We start with everyday analogies—recipes, spreadsheets, test scores, and blindfolded hill walking—to build your intuition for how machines learn from examples. By the end of this chapter you will understand the core loop behind every AI model.

~15 min

What is AI/ML? 🔗

Imagine learning to cook from a recipe versus learning to cook by tasting hundreds of dishes and figuring out what works. Traditional programming is the recipe approach — a programmer writes out every rule by hand: "if this, then that." Machine Learning (ML) flips this — instead of writing rules, you show the computer lots of examples of inputs and correct answers, and it figures out the rules on its own. The thing the computer builds from those examples is called a model, and it is really just a formula with some adjustable knobs (called parameters) that turn inputs into outputs.

Build It

This code gives the computer five example data points and lets it discover the pattern (a straight line) on its own — no rules written by hand.

import numpy as np

# The entire idea of ML in 8 lines
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2.1, 4.0, 5.8, 8.1, 9.9])

# Closed-form solution (OLS)
X_b = np.c_[X, np.ones(len(X))]  # add bias column
w = np.linalg.lstsq(X_b, y, rcond=None)[0]
print(f"y = {w[0]:.2f}x + {w[1]:.2f}")  # y ≈ 2.0x + 0.1

Under the Hood

In plain English: A brief look at the math and code powering this concept.

Even though the model learned on its own, the result is just two numbers — a slope and a starting point — so the entire "brain" fits in 16 bytes.

The 'model' here is two float64 numbers: a slope and an intercept — 16 bytes total. lstsq uses QR decomposition internally (O(nd²)), not matrix inversion (numerically unstable). This closed-form solution works for linear regression but doesn't scale to complex models — that's why we'll need gradient descent.

Key Takeaway

Traditional programming is like following a recipe; ML is like learning to cook by tasting — the computer finds the rules from examples instead of being told them
A model is just a formula with adjustable knobs — turn the knobs until the answers come out right
Even drawing a single best-fit line through five points captures the entire idea of machine learning

Builds on: What is AI/ML?

Data & Features 🔗

Think of a spreadsheet where every row is one house you are looking at, and every column is a measurement — square footage, number of bedrooms, distance to the nearest school. That is exactly how a computer sees data: rows are examples, columns are measurements (called features). Raw information like text or photos first has to be converted into numbers so the computer can work with it. One last trick: if square footage ranges up to 3,000 but bedrooms only go up to 5, the big numbers would push the small ones around. So we rescale everything to a similar range — a step called feature scaling (or standardization).

Build It

This code takes a tiny spreadsheet of houses (square footage and bedrooms) and rescales the columns so no single measurement hogs the spotlight.

import numpy as np

# Feature scaling: standardization
raw = np.array([[1500, 3], [2000, 4], [1200, 2]])  # [sq_ft, bedrooms]
means = raw.mean(axis=0)      # per-feature mean
stds = raw.std(axis=0)        # per-feature std
X = (raw - means) / stds      # scaled: mean=0, std=1

# Shape: (n_samples, n_features) = (3, 2)
print(f"Shape: {X.shape}, Means: {X.mean(axis=0)}")

Under the Hood

In plain English: A brief look at the math and code powering this concept.

Without rescaling, the learning process swerves wildly because one column's numbers are a thousand times bigger than another's — like trying to steer a car where the gas pedal is a thousand times more sensitive than the brake.

The feature matrix is a contiguous block of memory — (n_samples × n_features) × 8 bytes for float64. Standardization is O(n) per feature. Without scaling, gradient descent zigzags: a feature ranging 0-1000 creates gradients 1000× larger than a feature ranging 0-1, causing inefficient optimization.

Key Takeaway

The computer sees all data as a spreadsheet of numbers — rows are examples, columns are measurements
Rescaling (feature scaling) puts every column on an equal playing field, like converting miles and kilometers to the same unit
Skip this step and the learning process wobbles instead of heading straight for the answer

Builds on: Data & Features

Linear Regression 🔗

Picture a scatter of dots on a graph — say, house sizes along the bottom and prices up the side. Now lay a ruler across those dots so it passes as close to all of them as possible. That ruler is your first model, and the technique is called linear regression. It predicts the output by multiplying each input by a weight (how much that input matters), adding them up, and tossing in one extra number called a bias (where the line crosses the zero mark). Simple as it is, it introduces the three ideas behind every AI model: adjustable numbers (parameters), making a guess (prediction), and a neat bookkeeping shortcut called the bias trick.

\hat{y} = wx + b

import numpy as np

# Linear regression prediction
def predict(x, w, b):
    return w * x + b

# Example
x = np.array([1, 2, 3, 4, 5])
w, b = 2.0, 1.0
y_hat = predict(x, w, b)  # [3, 5, 7, 9, 11]

Build It

This code finds the single best-fit straight line through five data points — the computer figures out the slope and starting point all by itself.

import numpy as np

# The normal equation: w = (X^T X)^{-1} X^T y
X = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([2.1, 4.0, 5.8, 8.1, 9.9])

# Bias trick: append column of 1s
X_b = np.c_[X.reshape(-1, 1), np.ones(len(X))]

# Solve (use lstsq, NOT inv — numerically stable)
w = np.linalg.lstsq(X_b, y, rcond=None)[0]
print(f"y = {w[0]:.2f}x + {w[1]:.2f}")

Under the Hood

In plain English: A brief look at the math and code powering this concept.

There is a shortcut formula that finds the perfect line in one shot, but it gets painfully slow when you have lots of measurements — which is why later we will learn a step-by-step approach instead.

Matrix inversion is O(d³) where d = number of features. The bias trick folds b into the weight vector by appending a column of 1s to X, so [w, b] @ [x, 1] = w*x + b. Warning: when regularizing later (Section 12), do NOT penalize the bias column.

Key Takeaway

Linear regression is a ruler laid across your data — the simplest possible model, capturing the idea of "multiply, add up, predict"
The bias trick is a bookkeeping shortcut that bundles the line's starting point into the same math as the slope
The one-shot formula works great for small problems but chokes on big ones — motivating the step-by-step method coming next

Builds on: Linear Regression

Loss Functions — MSE 🔗

Think of a test score, but flipped: zero means perfect and the bigger the number, the worse you did. That is exactly what a loss function does — it looks at the model's guesses and the correct answers, and produces a single "wrongness score." The most common version is called Mean Squared Error (MSE). It works by checking how far off each guess was, squaring those gaps (so a big miss counts way more than a small one), and averaging them all together. Lower is always better.

\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

import numpy as np

def mse_loss(y_true, y_pred):
    """Mean Squared Error loss."""
    return np.mean((y_true - y_pred) ** 2)

# Example
y_true = np.array([3.0, 5.0, 7.0])
y_pred = np.array([2.8, 5.1, 7.3])
loss = mse_loss(y_true, y_pred)  # 0.0467

Build It

This code calculates how wrong a set of predictions is (the loss) and which direction each guess should move to get closer to the right answer (the gradient).

import numpy as np

y_true = np.array([3.0, 5.0, 7.0, 9.0])
y_pred = np.array([2.8, 5.2, 6.5, 9.1])

# MSE: mean of squared residuals
mse = np.mean((y_pred - y_true) ** 2)

# Gradient: points toward improvement
d_mse = 2 * (y_pred - y_true) / len(y_true)
print(f"MSE: {mse:.4f}")
print(f"Gradient: {d_mse}")

Under the Hood

In plain English: A brief look at the math and code powering this concept.

Squaring the errors is not just for show — it makes big misses hurt a lot more than small ones, which pushes the model to fix its worst guesses first.

The loss collapses all prediction errors into a single scalar. The gradient tells us which direction to adjust each prediction. Squaring serves two purposes: it's differentiable everywhere (unlike absolute value), and it penalizes large errors more than small ones. Cross-entropy (introduced in Section 8) is used instead when outputs are probabilities.

Key Takeaway

A loss function is a "wrongness score" — like a test score where zero is perfect and bigger is worse
MSE (Mean Squared Error) averages the squared gaps between guesses and answers — big misses get penalized heavily
The gradient is a signpost that tells the model which way to adjust to shrink that wrongness score

Builds on: Loss Functions — MSE

Gradient Descent 🔗

Imagine you are blindfolded in the middle of a hilly field and you need to find the lowest valley. You cannot see, but you can feel the ground sloping under your feet. Each step, you figure out which direction goes downhill and take a small step that way. This process of feeling your way downhill is called gradient descent, and it is how nearly every AI model learns. The size of each step is controlled by a setting called the learning rate. Take steps that are too big and you leap right over the valley; too small and you inch along forever.

w \leftarrow w - \eta \frac{\partial \mathcal{L}}{\partial w}

def gradient_descent_step(w, b, x, y, lr=0.01):
    """One step of gradient descent for linear regression."""
    n = len(x)
    y_pred = w * x + b
    dw = (-2/n) * np.sum(x * (y - y_pred))
    db = (-2/n) * np.sum(y - y_pred)
    w -= lr * dw
    b -= lr * db
    return w, b

Build It

This code starts with a random guess and repeatedly nudges it downhill until it lands on the answer (3.0) — the blindfolded-hill-walking idea in five lines.

import numpy as np

# Gradient descent in 5 lines
w = np.random.randn()  # random start
lr = 0.01              # learning rate

for step in range(100):
    grad = 2 * (w - 3.0)           # gradient of L = (w-3)^2
    w = w - lr * grad              # THE update rule
    print(f"Step {step}: w={w:.4f}, loss={(w-3)**2:.6f}")

Under the Hood

In plain English: A brief look at the math and code powering this concept.

The actual "step downhill" is trivially cheap; almost all the work goes into figuring out which direction is downhill in the first place.

The weight update w -= lr * gradient is O(1) per parameter. ALL the computational cost is in computing the gradient. SGD (Stochastic Gradient Descent) estimates the gradient from a random mini-batch instead of the full dataset — noisier but much faster per step. Adam adds momentum and adaptive learning rates per parameter.

Key Takeaway

Gradient descent is the blindfolded hill walk: feel which way slopes down, take a small step, repeat
The learning rate is your step size — the single most important setting to get right
Fancier versions — SGD (Stochastic Gradient Descent, which samples a small batch instead of all the data) and Adam (which builds up momentum like a rolling ball) — take shortcuts or build up speed, but the core idea never changes: step downhill

Builds on: Gradient Descent

The Training Loop 🔗

Teaching a dog a new trick follows a predictable pattern: you show the trick, watch what the dog does wrong, figure out how to correct it, and adjust your approach — then repeat. Every AI model learns the same way, in a four-step cycle: (1) make a guess (the forward pass), (2) check how wrong the guess was (the loss), (3) figure out which knobs to turn and how far (the backward pass — computing gradients), and (4) actually turn those knobs (the update). One full trip through all the training examples is called an epoch, and the model gets a little better with each lap.

\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)

for epoch in range(num_epochs):
    for x_batch, y_batch in dataloader:
        # Forward pass
        y_pred = model(x_batch)
        loss = loss_fn(y_pred, y_batch)

        # Backward pass
        loss.backward()

        # Update parameters
        optimizer.step()
        optimizer.zero_grad()

Build It

This code runs the full four-step loop (guess, score, figure out correction, adjust) fifty times on 100 data points, and you can watch the loss drop as the model learns.

import numpy as np

X = np.random.randn(100, 1)     # 100 samples, 1 feature
y = 2.5 * X + 1.0 + np.random.randn(100, 1) * 0.3
w, b = np.random.randn(), 0.0
lr = 0.01

for epoch in range(50):
    # 1. Forward
    y_pred = w * X + b
    # 2. Loss
    loss = np.mean((y_pred - y) ** 2)
    # 3. Backward (gradients)
    dw = 2 * np.mean(X * (y_pred - y))
    db = 2 * np.mean(y_pred - y)
    # 4. Update
    w -= lr * dw
    b -= lr * db
    if epoch % 10 == 0:
        print(f"Epoch {epoch}: loss={loss:.4f}, w={w:.3f}, b={b:.3f}")

Under the Hood

In plain English: A brief look at the math and code powering this concept.

The four-step loop looks almost identical whether you are training a tiny line-fitting model or a billion-parameter language model — only the size of the math changes.

These 4 lines (dw = X.T @ d_pred for gradients) IS backprop for a linear model. In mini-batch training, shapes go from (features,) to (batch_size, features) — every matmul is batched. Shape mismatches in batch dimensions are the #1 source of bugs in ML code.

Key Takeaway

Every AI model trains the same way, like teaching a dog: show the trick, see what went wrong, figure out the correction, adjust — repeat
This four-step loop (guess, score, figure out correction, adjust) is the heartbeat of all deep learning
Processing examples in small batches (mini-batching) lets the computer chew through data much faster by working on many examples at once

Builds on: The Training Loop

Single Neuron 🔗

Picture a tiny voting machine: several people each cast a vote with different levels of enthusiasm (some shout, some whisper), the machine adds up all those weighted votes, and then makes a yes-or-no decision based on the total. A single neuron works the same way — it takes several input numbers, multiplies each one by a weight (how much that input matters), adds them all up with a nudge value (the bias), and then passes the total through a special gate called an activation function. That gate is the twist: without it, a neuron could only draw straight lines. With it, it can learn curves. The order matters — you add up first, then pass through the gate, not the other way around.

a = \sigma\!\left(\sum_{i} w_i x_i + b\right)

import numpy as np

def neuron(x, w, b):
    """Single neuron with sigmoid activation."""
    z = np.dot(w, x) + b
    return 1 / (1 + np.exp(-z))

x = np.array([0.5, 0.3, 0.2])
w = np.array([0.4, -0.1, 0.8])
b = 0.1
output = neuron(x, w, b)  # ~0.60

Build It

This code builds one tiny voting machine: it multiplies three inputs by three weights, adds them up, and squishes the result through a gate that outputs a number between 0 and 1.

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# A single neuron
x = np.array([0.5, -0.3, 1.2])  # 3 inputs
w = np.array([0.8, -0.5, 1.0])  # 3 weights
b = 0.1                          # bias

z = np.dot(w, x) + b            # weighted sum: 1.45
a = sigmoid(z)                   # activation: 0.81

# Sigmoid derivative (computed from output alone!)
da_dz = a * (1 - a)             # 0.154
print(f"z={z:.2f}, a={a:.2f}, da/dz={da_dz:.3f}")

Under the Hood

In plain English: A brief look at the math and code powering this concept.

A single neuron does surprisingly little math — one multiplication-and-add, then one squeeze through a gate — but stacking thousands of them creates the power behind modern AI.

A neuron computes a dot product O(d) plus one nonlinear function O(1). The sigmoid derivative a*(1-a) can be computed from the output alone — no need to store z. Note: the biological neuron analogy is extremely loose — real neurons use spike timing, not continuous values.

Key Takeaway

A neuron is a tiny voting machine — it weighs its inputs, adds them up, and passes the total through a gate to make a decision
The gate (activation function) is the secret ingredient that lets the model learn curves instead of only straight lines
The sigmoid gate squishes any number, no matter how huge or negative, into a tidy range between 0 and 1 — handy for yes/no answers

Builds on: Single Neuron

Activation Functions + Softmax + Cross-Entropy 🔗

Imagine a series of gates in a water pipe, each deciding how much water to let through. That is what activation functions do inside a neural network — they control which signals pass and how strongly. The original gate, the sigmoid from the last section, has a problem: it lets less and less water through the deeper the pipe goes, until the flow is practically zero. This fading signal is called the vanishing gradient problem. A newer gate called ReLU (Rectified Linear Unit) fixes this with a dead-simple rule: let positive signals through untouched, block everything negative. Modern AI models use even smoother versions called GELU (Gaussian Error Linear Unit) and SwiGLU (Swish-Gated Linear Unit) — think of them as smarter valves that let a tiny trickle through even for slightly negative values. There is one more tool in this section: softmax, which takes a list of raw scores and converts them into percentages that add up to 100%, and cross-entropy, a loss function that measures how far off those percentages are from reality.

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} \qquad \mathcal{L} = -\sum_c y_c \log(\hat{y}_c)

import numpy as np

def relu(z):
    return np.maximum(0, z)

def softmax(z):
    exp_z = np.exp(z - np.max(z))  # numerical stability
    return exp_z / exp_z.sum()

def cross_entropy(y_true, y_pred):
    return -np.sum(y_true * np.log(y_pred + 1e-9))

Build It

This code defines several valve types (ReLU, sigmoid, GELU), converts raw scores into percentages with softmax, and measures how wrong those percentages are with cross-entropy loss.

import numpy as np

# Activation functions as one-liners
relu = lambda x: np.maximum(0, x)
sigmoid = lambda x: 1 / (1 + np.exp(-x))
gelu = lambda x: x * 0.5 * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))

# Softmax with numerical stability
def softmax(x):
    e = np.exp(x - np.max(x))   # subtract max to prevent overflow
    return e / e.sum()

# Cross-entropy loss
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)          # [0.659, 0.242, 0.099]
target = 0                       # true class index
loss = -np.log(probs[target])    # 0.417
print(f"Probs: {probs}, Loss: {loss:.3f}")

Under the Hood

In plain English: A brief look at the math and code powering this concept.

The sigmoid valve chokes the learning signal by 75% at every layer — stack ten layers and you have practically zero signal left, which is why deeper networks needed a better valve.

Vanishing gradient: sigmoid's max derivative is 0.25. After 10 layers: 0.25^10 = 9.5×10⁻⁷ — gradients effectively disappear. ReLU's derivative is exactly 1 for positive inputs, so gradients flow undiminished. GELU (x·Φ(x)) is smoother than ReLU and used in GPT/BERT. Modern LLMs (LLaMA, Mistral) use SwiGLU: (xW₁ · swish(xW₃)) @ W₂ — three weight matrices instead of two.

Key Takeaway

ReLU is a one-way valve that lets positive signals through at full strength, solving the fading-signal problem that plagued earlier valves like sigmoid
Softmax turns a list of raw scores into percentages that add up to 100% — so the model can say "I am 80% sure it is a cat"
Cross-entropy is a "wrongness score" for those percentages — the further from reality, the higher the penalty
Today's most powerful AI models (like GPT and LLaMA) use smoother, more advanced valves called GELU and SwiGLU

Chapter 2

Neural Networks

You have seen how a single tiny voting machine (neuron) makes decisions. Now it is time to build a whole team out of them. You will see how groups of neurons draw increasingly clever boundaries between categories, how the network figures out which teammates to blame when it gets an answer wrong, and how to stop it from just memorizing the answers instead of actually learning.

~15 min

Builds on: Activation Functions

Decision Boundaries 🔗

Imagine drawing lines on a map to separate two neighborhoods. One straight line can only split the map in two, which works for simple cases but fails when the neighborhoods are interleaved. A single neuron (our tiny voting machine from Chapter 1) can only draw one straight cut like that. But give it a few teammates, and the group can draw multiple lines that combine into curved, complex borders — separating even the trickiest layouts. These borders are called decision boundaries. Watch below how adding more neurons transforms a single straight cut into flexible curves that can separate any arrangement of data points.

\text{boundary: } \mathbf{w} \cdot \mathbf{x} + b = 0

# Visualize decision boundary of a trained model
import numpy as np

def plot_decision_boundary(model, X, y, resolution=100):
    x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
    y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
    xx, yy = np.meshgrid(
        np.linspace(x_min, x_max, resolution),
        np.linspace(y_min, y_max, resolution)
    )
    grid = np.c_[xx.ravel(), yy.ravel()]
    preds = model.predict(grid).reshape(xx.shape)
    # Plot contour of predictions

Build It

This code builds a small two-layer network and runs it on the XOR problem — a classic pattern that a single straight line cannot separate — then maps out the decision boundary across a grid.

import numpy as np

# XOR: not linearly separable
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y = np.array([0, 1, 1, 0])

# 2-layer network: 2 → 4 → 1
np.random.seed(42)
W1 = np.random.randn(2, 4) * 0.5
b1 = np.zeros(4)
W2 = np.random.randn(4, 1) * 0.5
b2 = np.zeros(1)

sigmoid = lambda z: 1 / (1 + np.exp(-z))

# Forward pass
h = sigmoid(X @ W1 + b1)     # hidden layer
out = sigmoid(h @ W2 + b2)   # output

# Evaluate on a grid for the decision boundary
xx, yy = np.mgrid[0:1:0.01, 0:1:0.01]
grid = np.c_[xx.ravel(), yy.ravel()]

Under the Hood

In plain English: A brief look at the math and code powering this concept.

Think of each neuron as drawing one straight line on the map, then the activation functions bend and combine those lines into curves. Each neuron contributes one linear boundary. Combined through nonlinear activations, they form arbitrarily complex regions. The grid evaluation is brute force — forward pass at every pixel. For a 100×100 grid with a network of width w: O(100² × w²) operations per frame.

Key Takeaway

One neuron can only draw one straight line on the map — but a group of neurons working together can carve out any shape you need
Some patterns (like XOR) are impossible to separate with a single straight cut — you need at least two neurons teaming up
As the network learns (adjusts its weights during training), you can watch these boundary lines shift and reshape in real time

Builds on: Decision Boundaries

Neural Networks & Layers 🔗

Picture a relay race where each runner takes the baton, does one simple thing to it, and passes it on. That is essentially how a neural network works — a team of specialists arranged in a chain, where each person does one simple job and passes the result to the next. The first person looks at the raw input, picks out a few things, and hands a summary to the second person, who refines it further, and so on until the last person delivers the final answer. Each “person” in this chain is a layer — it takes in numbers, multiplies them by a set of weights (how much attention to pay to each input), adds a nudge (called a bias), and runs the result through an activation function (the one-way valve from Chapter 1). Stacking these layers one after another is called a forward pass: input flows in one end and a prediction comes out the other.

\mathbf{h}^{(l)} = \sigma\!\left(\mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\right)

import numpy as np

class DenseLayer:
    def __init__(self, n_in, n_out):
        self.W = np.random.randn(n_out, n_in) * 0.01
        self.b = np.zeros((n_out, 1))

    def forward(self, x):
        self.x = x
        z = self.W @ x + self.b
        return np.maximum(0, z)  # ReLU

Build It

This code creates a full neural network class with multiple layers, runs a forward pass (input in, prediction out), and counts the total number of adjustable settings (parameters) the network has.

import numpy as np

class NeuralNetwork:
    def __init__(self, layers):
        # He initialization: scale by sqrt(2/fan_in)
        self.params = []
        for i in range(len(layers) - 1):
            W = np.random.randn(layers[i], layers[i+1]) * np.sqrt(2 / layers[i])
            b = np.zeros(layers[i+1])
            self.params.append((W, b))

    def forward(self, X):
        self.activations = [X]  # cache for backprop
        for i, (W, b) in enumerate(self.params):
            z = X @ W + b
            X = np.maximum(0, z) if i < len(self.params)-1 else 1/(1+np.exp(-z))
            self.activations.append(X)
        return X

net = NeuralNetwork([3, 4, 4, 2])  # 3→4→4→2
out = net.forward(np.random.randn(1, 3))
# Param count: 3×4+4 + 4×4+4 + 4×2+2 = 42

Under the Hood

In plain English: A brief look at the math and code powering this concept.

Under the surface, a neural network is just a list of number grids (weight matrices) with a nudge value (bias) for each layer. A network = `list[(W, b)]`. He initialization scales weights by √(2/fan_in) to prevent activations from exploding or collapsing to zero. Parameter count: Σ(layer_i × layer_{i+1} + layer_{i+1}). Caching activations during forward pass is essential — backprop needs them to compute gradients.

Key Takeaway

A neural network is just a chain of specialists — each layer does a simple calculation and passes the result forward, like a relay race
The starting values of the weights matter: if they are too big or too small, the signal either explodes or fades to nothing before it reaches the end (a technique called He initialization picks smart starting values)
The entire “thinking” process is the forward pass — data flows through every layer in order, and out comes a prediction

Builds on: Neural Networks & Layers

Backpropagation 🔗

Imagine a teacher grading a group project and tracing blame backward: “the conclusion was weak because the analysis was wrong, which happened because the data was misread.” That is exactly how a neural network learns from its mistakes. After the network makes a prediction (the forward pass), it checks how wrong the answer was (the loss score from Chapter 1). Then it works backward through every layer, asking: “How much did you contribute to this mistake?” Each layer gets a blame score that tells it how to adjust its weights. This backward blame-tracing process is called backpropagation (or “backprop” for short). It is the engine that makes the training loop from Chapter 1 actually work for networks with many layers.

\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}} \cdot \frac{\partial \mathbf{h}^{(l)}}{\partial \mathbf{W}^{(l)}}

class Value:
    """Tiny autograd engine (inspired by micrograd)."""
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._children = set(_children)

    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out

Build It

This code runs a two-layer network forward to get a prediction, then traces blame backward through every layer to figure out how each weight should change.

import numpy as np

sigmoid = lambda z: 1 / (1 + np.exp(-z))

# 2-layer network: forward then backward
X = np.array([[0.5, -0.3]])    # (1, 2)
y = np.array([[0.8]])           # target
W1 = np.random.randn(2, 3) * 0.5
b1 = np.zeros(3)
W2 = np.random.randn(3, 1) * 0.5
b2 = np.zeros(1)

# Forward (cache z and a at each layer)
z1 = X @ W1 + b1;  a1 = sigmoid(z1)
z2 = a1 @ W2 + b2; a2 = sigmoid(z2)
loss = np.mean((a2 - y) ** 2)

# Backward: three formulas per layer
dL_da2 = 2 * (a2 - y)
dL_dz2 = dL_da2 * a2 * (1 - a2)    # through activation
dL_dW2 = a1.T @ dL_dz2              # OUTER PRODUCT, not ⊗
dL_db2 = dL_dz2.sum(axis=0)
dL_da1 = dL_dz2 @ W2.T             # propagate backward
dL_dz1 = dL_da1 * a1 * (1 - a1)
dL_dW1 = X.T @ dL_dz1
dL_db1 = dL_dz1.sum(axis=0)

Under the Hood

In plain English: A brief look at the math and code powering this concept.

The backward pass is essentially the teacher walking from the final answer back to the first step, handing out blame at each stop. Backprop costs roughly the same as the forward pass. Memory requirement: 2× inference because you must store all z and a from the forward pass. Common bugs: (1) transposing the wrong matrix in the weight gradient, (2) forgetting to cache activations, (3) not zeroing gradients between iterations. The weight gradient `dL/dW = a_prev.T @ dL/dz` is an outer product — this is the most commonly confused operation in backprop.

Key Takeaway

Backpropagation is like a teacher grading a group project: it traces blame backward from the final answer through every layer to find out who made the mistake
Each layer gets a “blame score” (called a gradient) that tells it exactly how to adjust its weights — the same hill-feeling process from Chapter 1, but now applied layer by layer
Training uses roughly twice the memory of just making predictions, because the network has to remember its work from the forward pass so the backward pass can assign blame

Builds on: Backpropagation

Overfitting & Regularization 🔗

Think of a student who memorizes the answer key word-for-word instead of actually learning the subject — they ace every practice test but bomb the real exam because the questions are slightly different. Neural networks can do the same thing: instead of learning general patterns, they memorize the specific training examples, including the random noise and quirks. This is called overfitting. The warning sign is easy to spot: the network's error on its practice data (training loss) keeps going down, but its error on new, unseen data (validation loss) starts going up. To prevent this, we use tricks called regularization. One approach (called L2 regularization or weight decay) penalizes the network for having large weights, nudging it toward simpler solutions. Another (called dropout) randomly turns off some neurons during each training step, forcing the network to not rely too heavily on any single neuron.

\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda \sum_i w_i^2

import numpy as np

def l2_regularized_loss(y_true, y_pred, weights, lambda_=0.01):
    """MSE loss with L2 (weight decay) regularization."""
    mse = np.mean((y_true - y_pred) ** 2)
    l2_penalty = lambda_ * sum(np.sum(w**2) for w in weights)
    return mse + l2_penalty

def dropout(h, p=0.5, training=True):
    """Randomly zero out neurons during training."""
    if not training:
        return h
    mask = np.random.binomial(1, 1-p, size=h.shape) / (1-p)
    return h * mask

Build It

This code shows two ways to keep a network from memorizing: L2 regularization (penalizing large weights) and dropout (randomly silencing neurons during training).

import numpy as np

# Ridge regression: L2 regularization
X = np.random.randn(50, 5)
y = X @ np.array([1, 2, 0, -1, 0.5]) + np.random.randn(50) * 0.3
lam = 1.0  # regularization strength

# Closed-form: w = (X^T X + λI)^{-1} X^T y
# NOTE: do NOT penalize the bias column
XtX = X.T @ X
w_ridge = np.linalg.solve(XtX + lam * np.eye(5), X.T @ y)

# Dropout (training time)
def dropout(a, p=0.5):
    mask = np.random.binomial(1, 1-p, size=a.shape)
    return a * mask / (1 - p)  # scale to maintain expected value

Under the Hood

In plain English: A brief look at the math and code powering this concept.

Both techniques work by making the network keep things simple — small weights mean gentler, smoother predictions rather than wild, spiky ones. L2 regularization adds λ||w||² to the loss, pushing weights toward zero. The term +λI also makes X^TX invertible (useful when features > samples). Dropout during training creates a different ‘sub-network’ each step — at test time, it’s equivalent to an ensemble average. The 1/(1-p) scaling ensures activations have the same expected value during training and inference.

Key Takeaway

Overfitting is like memorizing the answer key — great on practice tests, terrible on the real exam. The warning sign is when validation loss starts climbing while training loss keeps falling.
L2 regularization is like a “keep it simple” rule — it penalizes the network for using big weights, nudging it toward smoother, more general solutions
Dropout randomly turns off neurons during training, like forcing a team to practice without their star player — everyone else has to step up, making the whole team more resilient

Builds on: Overfitting & Regularization

Residual Connections & Normalization 🔗

Imagine a long game of telephone: by the time a message passes through 50 people, it is completely garbled. Deep neural networks have the same problem — information and learning signals get weaker with every layer they pass through. Surprisingly, a 50-layer network can perform worse than a 20-layer one, even on data it has already seen (so it is not just memorizing poorly). This is called the degradation problem. The fix is beautifully simple: give each person in the telephone game a written copy of the original message alongside the whispered one. In network terms, you add a shortcut that lets the original input skip over a layer and get added directly to that layer’s output. This shortcut is called a residual connection (or skip connection). It creates a highway for information and learning signals to flow through, even in networks with hundreds of layers. Alongside residual connections, modern networks also use normalization — a step that keeps the numbers flowing through the network in a reasonable range, like an editor making sure each person’s notes are the same font size before passing them along.

\mathbf{h} = \mathbf{x} + \mathcal{F}(\mathbf{x}) \qquad \text{LayerNorm}(\mathbf{x}) = \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta

import numpy as np

def residual_block(x, W1, b1, W2, b2):
    """A simple residual block: x + F(x)."""
    h = np.maximum(0, W1 @ x + b1)  # ReLU
    out = W2 @ h + b2
    return x + out  # skip connection

def layer_norm(x, gamma, beta, eps=1e-5):
    """Layer normalization."""
    mu = np.mean(x, axis=-1, keepdims=True)
    sigma = np.std(x, axis=-1, keepdims=True)
    return gamma * (x - mu) / (sigma + eps) + beta

Build It

This code builds a residual block (where the input gets added back to the output) and a normalization step, the two ingredients that let modern networks grow to hundreds of layers deep.

import numpy as np

# Residual block: the +x is the entire innovation
def residual_block(x, W1, W2):
    z = np.maximum(0, x @ W1)   # ReLU(x @ W1)
    F_x = z @ W2                # second linear
    return np.maximum(0, F_x + x)  # F(x) + x, then ReLU

# RMSNorm (used in LLaMA/Mistral — faster than LayerNorm)
def rms_norm(x, eps=1e-6):
    return x / np.sqrt(np.mean(x ** 2) + eps)

# Pre-LN transformer block (modern standard)
# x_out = x + Attention(rms_norm(x))
# x_out = x_out + FFN(rms_norm(x_out))

Under the Hood

In plain English: A brief look at the math and code powering this concept.

The “written copy of the original message” means the learning signal always has a clean path through the network, even if some layers are struggling. The gradient through a residual block is dy/dx = dF/dx + I. Even if dF/dx vanishes, the identity term I keeps gradient magnitude at 1. Zero extra parameters. The key motivation: the degradation problem — a 56-layer plain network had higher training error than a 20-layer one. Pre-LN (Pre-Layer Normalization — normalizing before each sublayer instead of after) is the modern standard used in GPT-2+, Claude, and LLaMA. RMSNorm (Root Mean Square Normalization) skips mean-centering for ~15% speedup.

Key Takeaway

Residual connections are like giving each person in a telephone game a written copy of the original message — even if the whispered version gets garbled, the written one keeps the information intact
Normalization keeps the numbers flowing through the network in a tidy range, like an editor standardizing everyone's handwriting so the next person can read it clearly
These two tricks together are what let modern AI models stack hundreds of layers deep without the signal falling apart

Supplement Section 14: CNNs & RNNs — Click to expand

Builds on: Residual Connections & Normalization

CNNs & RNNs 🔗

Imagine scanning a photo with a magnifying glass, inch by inch, looking for familiar patterns. That is essentially how one older AI design works. A CNN (Convolutional Neural Network) works like a magnifying glass sliding across an image — it looks at one small patch at a time, checking for patterns like edges, curves, or textures, then slides over to the next patch. By stacking layers, it builds up from simple edges to complex features like faces or cars. An RNN (Recurrent Neural Network) works like reading a book one word at a time while keeping a mental summary of everything you have read so far — each new word updates that running summary, so the network can handle sequences like sentences or time series. Both designs have been largely replaced by a newer architecture called the transformer (which you will meet in Chapter 4), but the ideas behind them still show up everywhere.

\text{CNN: } (f * x)[i] = \sum_k f[k] \cdot x[i+k] \qquad \text{RNN: } h_t = \tanh(W_h h_{t-1} + W_x x_t + b)

import numpy as np

def conv1d(x, kernel):
    """Simple 1D convolution (no padding)."""
    k_len = len(kernel)
    out_len = len(x) - k_len + 1
    return np.array([np.dot(x[i:i+k_len], kernel) for i in range(out_len)])

def rnn_step(x_t, h_prev, W_h, W_x, b):
    """Single RNN step."""
    return np.tanh(W_h @ h_prev + W_x @ x_t + b)

Build It

This code shows a CNN filter sliding across a signal to detect edges, and a single step of an RNN updating its running summary with new input.

import numpy as np

# 1D Convolution: a filter slides across input
signal = np.array([1, 0, 2, 3, 1, 0, 1])
kernel = np.array([1, 0, -1])  # edge detector
conv = np.convolve(signal, kernel, mode='valid')
# conv = [1, -2, 1, 2, 0]

# Simple RNN step
def rnn_step(x_t, h_prev, W_xh, W_hh, b):
    h_t = np.tanh(x_t @ W_xh + h_prev @ W_hh + b)
    return h_t

Under the Hood

In plain English: A brief look at the math and code powering this concept.

The key efficiency trick: the magnifying glass (CNN filter) uses the same lens everywhere it looks, and the reader (RNN) uses the same “how to update my summary” rule at every step. CNNs share weights across spatial positions (a 3×3 filter has only 9 params regardless of image size). RNNs share weights across time steps but suffer from vanishing gradients over long sequences — LSTMs (Long Short-Term Memory networks) added gates to control information flow. Transformers replaced both by processing all positions in parallel via attention.

Key Takeaway

A CNN is like a magnifying glass that slides across an image, checking each small patch for patterns like edges and textures — it uses the same lens everywhere, so it needs very few settings to learn
An RNN reads a sequence one step at a time while keeping a running summary of what came before — but for very long sequences, the early memories tend to fade (a problem called vanishing gradients)
Transformers (coming in Chapter 4) replaced both by looking at all parts of the input at once, which is faster and handles long-range patterns better

Chapter 3

Representation

Computers only understand numbers, not words. Before an AI can read a sentence, every word has to be translated into a list of numbers — and the sentence itself has to be chopped into bite-sized pieces. This chapter shows you how that translation and chopping happen.

~8 min

Builds on: Neural Networks & Layers

Embeddings 🔗

Imagine a map where cities are placed by similarity — Paris near Rome because both are European capitals, Tokyo near Seoul because both are Asian capitals. Embeddings do exactly this for words: they place each word on a “map” made of numbers, so that related words end up close together. The simplest approach — giving each word its own switch in a giant row of off-switches, then flipping just one on (called “one-hot encoding”) — wastes space and tells you nothing about which words are related. Embeddings fix this by giving every word a compact list of numbers (a “dense vector”) where similar words get similar numbers. Under the hood, it is just a table lookup — embedding = matrix[token_id] — but the table is learned through training (backprop, the “trace blame backward” process from Chapter 2).

\mathbf{e}_i = \mathbf{E}[i] \in \mathbb{R}^d \qquad \text{similarity} = \cos(\mathbf{e}_a, \mathbf{e}_b) = \frac{\mathbf{e}_a \cdot \mathbf{e}_b}{\|\mathbf{e}_a\| \|\mathbf{e}_b\|}

import numpy as np

# Simple embedding lookup table
vocab_size, embed_dim = 1000, 64
E = np.random.randn(vocab_size, embed_dim) * 0.01

def embed(token_id):
    return E[token_id]

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Build It

This code creates a word-to-numbers lookup table and a function that checks how similar two words are by comparing their number lists.

import numpy as np

# Embedding is just array indexing
vocab_size, d_model = 50000, 768
embedding_matrix = np.random.randn(vocab_size, d_model) * 0.02

token_id = 1234
embedding = embedding_matrix[token_id]  # (768,) — that's it!

# Cosine similarity: are two words related?
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Under the Hood

In plain English: A brief look at the math and code powering this concept.

Looking up an embedding is as cheap as copying a row from a spreadsheet — no heavy math at all. Embedding lookup is O(d) — just copying d numbers. No multiplication. GPT-2's embedding matrix: 50,257 × 768 × 4 bytes = ~148MB. Embeddings are learned through backprop: gradients only flow to the rows that were looked up (sparse updates). The 'king - man + woman ≈ queen' arithmetic works because the model learns consistent vector offsets for semantic relationships.

Key Takeaway

An embedding is like a GPS coordinate for a word — it turns a word into a list of numbers that capture its meaning.
Words with similar meanings land near each other on the “map,” just like Paris and Rome sit close on a real map.
The whole process is just looking up a row in a table, and the table improves as the model trains.

Builds on: Embeddings

Tokenization 🔗

Like a child learning to read by sounding out syllables — “un-happi-ness” — AI models break text into small, manageable pieces before they can process it. These pieces are called “tokens.” Splitting letter by letter makes sentences painfully long. Splitting by whole words breaks when the model meets a word it has never seen. A technique called BPE (Byte Pair Encoding — a method that builds a vocabulary by repeatedly gluing together the most common neighboring pieces) finds the sweet spot: it starts with individual letters, then keeps merging the pairs that appear together most often. After about 50,000 merges you get a vocabulary that can handle any text efficiently.

\text{"unhappiness"} \xrightarrow{\text{BPE}} [\text{un}, \text{happ}, \text{iness}]

def simple_bpe_step(corpus, num_merges=10):
    """Simplified BPE: repeatedly merge most frequent pair."""
    # Start with character-level tokens
    tokens = [list(word) + ['</w>'] for word in corpus]

    for _ in range(num_merges):
        # Count all adjacent pairs
        pairs = {}
        for word_tokens in tokens:
            for i in range(len(word_tokens) - 1):
                pair = (word_tokens[i], word_tokens[i+1])
                pairs[pair] = pairs.get(pair, 0) + 1

        # Merge the most frequent pair
        best = max(pairs, key=pairs.get)
        # ... merge logic
    return tokens

Build It

This code starts with individual letters and repeatedly merges the most common pair, showing you each merge step — just like BPE builds its vocabulary.

# Simplified BPE implementation
def bpe(text, num_merges=10):
    tokens = list(text)  # start with characters
    for i in range(num_merges):
        # Count all adjacent pairs
        pairs = {}
        for j in range(len(tokens) - 1):
            pair = (tokens[j], tokens[j+1])
            pairs[pair] = pairs.get(pair, 0) + 1
        if not pairs:
            break
        # Merge most frequent pair
        best = max(pairs, key=pairs.get)
        new_tokens = []
        j = 0
        while j < len(tokens):
            if j < len(tokens)-1 and (tokens[j], tokens[j+1]) == best:
                new_tokens.append(tokens[j] + tokens[j+1])
                j += 2
            else:
                new_tokens.append(tokens[j])
                j += 1
        tokens = new_tokens
        print(f"Merge {i+1}: '{best[0]}'+'{best[1]}' → '{best[0]+best[1]}'")
    return tokens

Under the Hood

In plain English: A brief look at the math and code powering this concept.

Think of BPE as a zip file for language — it finds repeating patterns and squishes them together to save space. BPE is greedy compression. GPT-2 uses ~50K merges starting from bytes (byte-level BPE). Vocab size tradeoffs: larger vocab = more embedding params but fewer tokens per sequence. English averages ~1.3 tokens/word; CJK languages use 2-3× more tokens per word — this means shorter effective context windows for non-English text.

Key Takeaway

BPE builds a dictionary by gluing together letter-pairs that often appear side by side — like noticing “th” shows up everywhere in English and making it one piece.
The size of this dictionary is a trade-off: a bigger dictionary means the model needs more memory, and some languages get shortchanged with fewer entries.
This chopping-up step happens before the AI ever sees the text — it is the very first stage of the pipeline.

Chapter 4

Attention & Transformers

This is where the magic happens. You will learn how an AI decides which words in a sentence matter most to each other, how those decisions are stacked into a powerful assembly-line architecture, and how that architecture powers the chatbots and writing tools you use every day.

~12 min

Builds on: Embeddings

Attention & Multi-Head 🔗

You are at a crowded party and someone across the room says your name — your brain instantly zeros in on that voice and tunes out the noise. AI attention works the same way: each word “listens” to every other word in the sentence and decides which ones matter most right now. To do this, every word creates three things: a Query (“what am I looking for?”), a Key (“what do I contain?”), and a Value (“what information do I carry?”). The model figures out how much two words should pay attention to each other by comparing their Query and Key (using a dot product, which is just a way of measuring similarity). Then it uses softmax (the “pick-a-winner” function from Chapter 1) to turn those raw scores into percentages that add up to 100%. This is the single most important equation in modern AI.

\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

import numpy as np

def scaled_dot_product_attention(Q, K, V):
    """Scaled dot-product attention."""
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    weights = softmax(scores)  # along last axis
    return weights @ V

def multi_head_attention(x, W_q, W_k, W_v, W_o, n_heads):
    """Multi-head attention (simplified)."""
    d = x.shape[-1]
    head_dim = d // n_heads
    heads = []
    for h in range(n_heads):
        Q = x @ W_q[h]
        K = x @ W_k[h]
        V = x @ W_v[h]
        heads.append(scaled_dot_product_attention(Q, K, V))
    return np.concatenate(heads, axis=-1) @ W_o

Build It

This code computes attention scores between words and then uses multiple “attention heads” so the model can focus on different types of relationships at the same time.

import numpy as np

def attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)     # (seq, seq)
    weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
    weights /= weights.sum(axis=-1, keepdims=True)  # softmax
    return weights @ V                    # (seq, d_v)

# Multi-head: split, attend, concatenate
def multi_head(X, n_heads, Wq, Wk, Wv, Wo):
    Q, K, V = X @ Wq, X @ Wk, X @ Wv
    d_k = Q.shape[-1] // n_heads
    heads = []
    for i in range(n_heads):
        qi = Q[:, i*d_k:(i+1)*d_k]
        ki = K[:, i*d_k:(i+1)*d_k]
        vi = V[:, i*d_k:(i+1)*d_k]
        heads.append(attention(qi, ki, vi))
    return np.concatenate(heads, axis=-1) @ Wo

Under the Hood

In plain English: A brief look at the math and code powering this concept.

The big cost of attention is that every word has to check in with every other word, so doubling the sentence length quadruples the work. Q@K^T creates a (seq_len × seq_len) attention matrix — this is why attention is O(seq² × d). The √d_k scaling prevents dot products from growing too large (when Q,K entries are iid with variance 1, the dot product has variance d_k — dividing by √d_k keeps variance at 1, preventing softmax saturation). Multi-head attention lets the model attend to different types of relationships simultaneously.

Key Takeaway

Attention is the AI asking “which other words should I pay attention to right now?” — like your brain picking out your name in a noisy room.
Because every word checks every other word, the work grows rapidly with longer text — this is why chatbots have a limit on how much text they can handle at once.

Builds on: Attention & Multi-Head

The Transformer Architecture 🔗

Picture a factory assembly line where each station has two jobs: first, the workers discuss which parts of the project matter most (that is attention); then each worker refines their own piece independently (that is the feed-forward network, or FFN). After each job, every worker keeps a photocopy of what they had before so nothing gets lost (that is the residual connection — the “telephone game with written copies” from Chapter 2). A transformer is just a stack of these identical stations, one after another. The FFN holds roughly two-thirds of all the model's learned information — think of it as the factory's filing cabinet of knowledge. Modern transformers also tidy up the numbers before each step (called “normalization”) to keep things stable.

\begin{aligned} \mathbf{h} &= \text{LayerNorm}(\mathbf{x} + \text{MHA}(\mathbf{x})) \\ \mathbf{o} &= \text{LayerNorm}(\mathbf{h} + \text{FFN}(\mathbf{h})) \end{aligned}

class TransformerBlock:
    def __init__(self, d_model, n_heads):
        self.attention = MultiHeadAttention(d_model, n_heads)
        self.ffn = FeedForward(d_model, d_model * 4)
        self.ln1 = LayerNorm(d_model)
        self.ln2 = LayerNorm(d_model)

    def forward(self, x):
        # Self-attention with residual + norm
        h = self.ln1(x + self.attention(x))
        # Feed-forward with residual + norm
        return self.ln2(h + self.ffn(h))

Build It

This code builds one station of the assembly line: it normalizes, runs attention, adds the residual shortcut, then does the same for the feed-forward step.

import numpy as np

def transformer_block(x, attn_fn, W1, W2, b1, b2):
    # Pre-LN: normalize, then sublayer, then add
    normed = rms_norm(x)
    x = x + attn_fn(normed)     # attention + residual
    normed = rms_norm(x)
    # FFN: expand to 4×d, activate, project back
    h = np.maximum(0, normed @ W1 + b1)  # (seq, 4*d)
    x = x + h @ W2 + b2                   # (seq, d) + residual
    return x

# Param count per block:
# Attention: 4 × d² (Wq, Wk, Wv, Wo)
# FFN: 2 × d × 4d = 8d²
# Total per block: ~12d²
# LLaMA-7B: d=4096, N=32 → ~6.7B params

Under the Hood

In plain English: A brief look at the math and code powering this concept.

The feed-forward network is where the model stores most of what it "knows" — facts, grammar rules, and patterns it picked up during training. The FFN expands to 4× d_model, applies an activation, and projects back. SwiGLU (used in LLaMA/Mistral) uses 3 matrices: (xW₁ · swish(xW₃)) @ W₂. Mixture of Experts (MoE) replaces one FFN with N expert FFNs + a router — more parameters without proportional compute (Mixtral has 46.7B params but only uses 12.9B per forward pass).

Key Takeaway

A transformer is a stack of identical stations, each combining a “discussion round” (attention) with a “solo refinement step” (FFN), plus photocopied shortcuts (residuals) and tidying-up (normalization).
The FFN is the filing cabinet — it holds the bulk of what the model has memorized, from facts to grammar.
A clever trick called Mixture of Experts (MoE) lets a model have a huge filing cabinet but only open a few drawers at a time, keeping it fast.

Builds on: The Transformer Architecture

How LLMs Work 🔗

Think of an incredibly well-read autocomplete — the kind that has read billions of web pages, books, and articles. A Large Language Model (LLM) does one thing: predict the next word. Your text goes in, gets chopped into tokens (Section 16), translated into number-lists called embeddings (Section 15), and then passed through the transformer assembly line (Section 18). At the end, the model looks at every word in its dictionary and assigns each one a probability — “how likely is this word to come next?” The winner (or a randomly chosen high-scorer) becomes the next word, and the whole process repeats, one word at a time.

P(x_t | x_1, \dots, x_{t-1}) = \text{softmax}(\mathbf{W}_{\text{out}} \cdot \text{Transformer}(x_1, \dots, x_{t-1}))

class SimpleLM:
    def __init__(self, vocab_size, d_model, n_layers, n_heads):
        self.embed = EmbeddingTable(vocab_size, d_model)
        self.blocks = [TransformerBlock(d_model, n_heads)
                       for _ in range(n_layers)]
        self.ln_f = LayerNorm(d_model)
        self.head = Linear(d_model, vocab_size)

    def forward(self, token_ids):
        x = self.embed(token_ids)
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = self.head(x)  # (seq_len, vocab_size)
        return logits

Build It

This code takes the transformer's final output for a sentence, scores every word in the dictionary, applies a “temperature” dial (higher = more creative, lower = more predictable), and picks the next word.

import numpy as np

# The LM head: project hidden state to vocabulary
d_model, vocab_size = 768, 50000
h_last = np.random.randn(d_model)           # last hidden state
W_head = np.random.randn(vocab_size, d_model) # often = embedding.T

logits = W_head @ h_last                     # (50000,)
# Temperature scaling
T = 0.8
probs = np.exp((logits - logits.max()) / T)
probs /= probs.sum()                        # softmax with temperature

next_token = np.random.choice(vocab_size, p=probs)

Under the Hood

In plain English: A brief look at the math and code powering this concept.

Running an LLM is like passing a message through every station on the assembly line, then checking every word in the dictionary at the end — the longer the message and the bigger the dictionary, the more work it takes. Full forward pass cost: embedding O(seq×d), attention O(N×seq²×d), FFN O(N×seq×d²), LM head O(vocab×d). Weight tying: the LM head often shares the embedding matrix transposed (W_head = embedding.T), saving vocab×d parameters. During training, every position predicts the next token simultaneously — seq_len training examples from one sequence.

Key Takeaway

An LLM is a next-word predictor: it looks at everything written so far and guesses what comes next, like the world's most well-read autocomplete.
The final step (called the “LM head”) converts the transformer's internal numbers into a probability for every word in the dictionary.
A common memory-saving trick: the same table used to convert words into numbers at the start is reused in reverse at the end (called “weight tying”).

Chapter 5

Training & Using LLMs

You know how the engine is built — now it is time to drive the car. This chapter shows how an AI learns from massive amounts of text, how it writes responses one word at a time, how you can guide it with clever instructions, and how you can give it access to outside knowledge and tools. By the end, you will build a tiny working language model from scratch.

~15 min

Builds on: How LLMs Work

Training LLMs 🔗

Raising a child who learns to speak happens in stages — and training an AI language model follows the same pattern. First, the child listens to millions of conversations and picks up the patterns of language — this is called pre-training, where the model reads vast amounts of text and learns to predict the next word. Then the child learns manners and social rules — this is alignment, where techniques like RLHF (Reinforcement Learning from Human Feedback, meaning humans rate the AI's answers so it learns which responses are helpful and safe) and DPO (Direct Preference Optimization, a simpler way to teach preferences) fine-tune the model's behavior. Finally, the child might specialize for a particular job — this is fine-tuning, and a technique called LoRA (Low-Rank Adaptation) makes this practical by adjusting only a small fraction of the model's settings instead of rewriting everything from scratch.

$$\mathcal{L}_{\text{pretrain}} = -\sum_{t=1}^{T} \log P(x_t | x_{

# Simplified training stages
# Stage 1: Pre-training (next token prediction)
for batch in pretrain_dataloader:
    logits = model(batch.input_ids)
    loss = cross_entropy(logits, batch.target_ids)
    loss.backward()
    optimizer.step()

# Stage 2: Supervised Fine-Tuning (SFT)
for batch in instruction_dataloader:
    logits = model(batch.prompt + batch.response)
    loss = cross_entropy(logits, batch.response)  # only on response
    loss.backward()
    optimizer.step()

# Stage 3: RLHF (simplified)
# Train reward model, then optimize policy with PPO

Build It

This code calculates how wrong the model's guess was and then nudges its settings in the right direction — the same show-observe-correct-repeat cycle from earlier chapters, applied to language.

import numpy as np

# Cross-entropy gradient (elegant simplification)
probs = softmax(logits)            # model's predictions
target = 42                         # true next token ID
loss = -np.log(probs[target])      # cross-entropy loss
grad = probs.copy()
grad[target] -= 1                   # gradient = softmax - one_hot

# AdamW update (decoupled weight decay)
m = beta1 * m + (1 - beta1) * grad          # momentum
v = beta2 * v + (1 - beta2) * grad ** 2     # velocity
m_hat = m / (1 - beta1 ** t)                # bias correction
v_hat = v / (1 - beta2 ** t)
w = w * (1 - lr * weight_decay)             # decoupled decay
w = w - lr * m_hat / (np.sqrt(v_hat) + eps)

Under the Hood

In plain English: A brief look at the math and code powering this concept.

Here is what happens behind the scenes when the model learns from its mistakes. Cross-entropy gradient is simply softmax - one_hot. AdamW (Adam with Weight Decay, a popular optimizer) stores 2× the model parameters (m and v states): a 7B model needs ~56GB just for optimizer state. RLHF trains a reward model on human preferences, then uses PPO (Proximal Policy Optimization, a reinforcement learning algorithm) to optimize. DPO simplifies this by directly optimizing on preference pairs without a reward model. LoRA: freeze base weights, add W + A@B where A and B are small rank-r matrices.

Key Takeaway

An AI learns language the way a child does — by hearing billions of sentences and getting better at guessing the next word.
After learning language, it learns manners — alignment (RLHF/DPO) teaches it to be helpful and safe, like a parent correcting behavior.
You do not have to retrain the whole brain to teach it a new skill — LoRA lets you fine-tune just a small piece, making customization practical even on a laptop.

Builds on: Training LLMs

Inference & Decoding 🔗

An AI writes a story the same way you might — one word at a time, where each new word depends on everything written so far. For each word, the model looks at all the words before it and picks the next one. How it picks matters: always choosing the most obvious word (called greedy decoding) produces dull, predictable text. Adding a bit of randomness (called temperature) makes it more creative. Filters like top-k (only consider the k most likely words) and top-p (only consider words whose combined chances add up to p) keep it from going off the rails. There is also a crucial speed trick called the KV cache (Key-Value cache) — instead of re-reading the entire story from the beginning every time it writes a new word, the model remembers what it already processed, like using a bookmark instead of starting over from page one.

P(x_t) = \text{softmax}\!\left(\frac{\text{logits}}{T}\right) \qquad T = \text{temperature}

import numpy as np

def sample_with_temperature(logits, temperature=1.0):
    """Sample from logits with temperature scaling."""
    scaled = logits / temperature
    probs = np.exp(scaled) / np.sum(np.exp(scaled))
    return np.random.choice(len(probs), p=probs)

def top_k_sampling(logits, k=10):
    """Keep only top-k logits, zero out the rest."""
    indices = np.argsort(logits)[-k:]
    mask = np.full_like(logits, -np.inf)
    mask[indices] = logits[indices]
    probs = np.exp(mask) / np.sum(np.exp(mask))
    return np.random.choice(len(probs), p=probs)

Build It

This code picks the next word by adjusting how adventurous the model is (temperature) and filtering out unlikely choices (top-k and top-p), then rolling the dice among the remaining options.

import numpy as np

def sample_token(logits, temperature=1.0, top_k=0, top_p=0.9):
    logits = logits / temperature
    if top_k > 0:
        top_k_idx = np.argsort(logits)[-top_k:]
        mask = np.full_like(logits, -np.inf)
        mask[top_k_idx] = logits[top_k_idx]
        logits = mask
    probs = np.exp(logits - logits.max())
    probs /= probs.sum()
    if top_p < 1.0:
        sorted_idx = np.argsort(probs)[::-1]
        cumsum = np.cumsum(probs[sorted_idx])
        cutoff = sorted_idx[cumsum > top_p]
        if len(cutoff) > 0:
            probs[cutoff] = 0
            probs /= probs.sum()  # renormalize!
    return np.random.choice(len(probs), p=probs)

Under the Hood

In plain English: A brief look at the math and code powering this concept.

The biggest performance trick in text generation is avoiding redundant work. KV cache: store K,V for each layer (memory: batch × n_layers × seq_len × d_model). For each new token, compute only its Q and attend to all cached K,V. This reduces per-token computation from O(seq²) to O(seq). Paged attention (vLLM) manages KV cache like virtual memory pages to reduce waste. Speculative decoding: a small draft model proposes N tokens, the large model verifies in parallel — up to Nx speedup with identical output.

Key Takeaway

The KV cache is like using a bookmark — instead of re-reading the whole book for every new word, the model remembers what it already processed, making generation dramatically faster.
Temperature is a creativity dial: turn it up for surprising, imaginative text; turn it down for safe, predictable answers. Top-k and top-p act as quality filters that remove nonsensical choices.
Speculative decoding is like having a fast assistant draft several words ahead and the expert just checks them — much faster, with identical results.

Builds on: Inference & Decoding

Context Windows & Prompting 🔗

Imagine reading a book through a small window that only shows a few pages at a time — that is an AI's context window, its short-term memory. The model can only "see" a limited amount of text at once, and everything outside that window is invisible to it. To keep track of word order, the model uses position encoding, a way of stamping each word with its place in the sentence (like page numbers in a book). Prompting — the art of writing good instructions for AI — works because the instructions, any examples you provide, and your actual question all get fed into this same window as regular text, and the attention mechanism (hearing your name at a party) treats them all equally.

\text{context} = [\underbrace{\text{system}}_{\text{role}} \| \underbrace{\text{examples}}_{\text{few-shot}} \| \underbrace{\text{query}}_{\text{user}}] \quad |\text{context}| \leq L_{\max}

# Prompting strategies
zero_shot = "Translate to French: Hello"

few_shot = """Translate to French:
Hello -> Bonjour
Goodbye -> Au revoir
Thank you -> Merci
Good morning ->"""

chain_of_thought = """Q: If a store has 5 apples and
sells 2, how many remain?
Let's think step by step:
1. Start with 5 apples
2. Sell 2 apples
3. 5 - 2 = 3 apples remain
A: 3"""

Build It

This code creates the "page numbers" that tell the model where each word sits in the sentence, using a wave pattern that gives every position a unique fingerprint.

import numpy as np

# Sinusoidal positional encoding (original Transformer)
def positional_encoding(seq_len, d_model):
    pos = np.arange(seq_len)[:, np.newaxis]
    dim = np.arange(d_model)[np.newaxis, :]
    angle = pos / 10000 ** (2 * (dim // 2) / d_model)
    pe = np.zeros((seq_len, d_model))
    pe[:, 0::2] = np.sin(angle[:, 0::2])
    pe[:, 1::2] = np.cos(angle[:, 1::2])
    return pe

# Usage: add to embeddings
# X = token_embeddings + positional_encoding(seq_len, d_model)

Under the Hood

In plain English: A brief look at the math and code powering this concept.

The way models keep track of word order has improved significantly over time. Position encoding evolution: sinusoidal (fixed) → learned absolute (GPT-2) → RoPE (Rotary Position Embedding, used in modern models like LLaMA). RoPE rotates the Query and Key vectors by position-dependent angles, making attention scores depend on relative position (how far apart two words are, not their absolute positions). This enables context extension via NTK-aware scaling (a mathematical trick to stretch the model's window to longer texts than it was trained on). Prompting is not a special mechanism — few-shot examples work because attention sees the pattern and continues it.

Key Takeaway

The context window is the AI's short-term memory — it can only see a fixed amount of text at once, like reading through a window. Modern position-tracking methods like RoPE (Rotary Position Embedding) let the model understand how far apart words are, and can be stretched to widen that window.
Prompting is not magic — your instructions, examples, and questions are all just words in the window, and the model pays attention to all of them equally when crafting its response.

Builds on: Context Windows & Prompting

RAG, Tool Use & Agents 🔗

Picture an expert with amnesia — brilliant, but they cannot remember anything beyond what is in front of them right now. That is an AI without help: its training data is frozen in the past, and its short-term memory (context window) is limited. RAG (Retrieval-Augmented Generation, which means "fetch relevant reference pages before answering") fixes this by looking up the most useful documents and slipping them into the prompt so the model can read them. Tool use goes further — it lets the model call outside services like a calculator, a search engine, or a database, the way you might pick up a phone to look something up. Agents combine all of this into a loop: the model thinks about what to do, takes an action, observes the result, and repeats until the task is done.

\text{RAG: } P(y|q) = \sum_{d \in \text{retrieve}(q)} P(y|q,d) \cdot P(d|q)

# Simplified RAG pipeline
def rag_answer(query, documents, model):
    # 1. Embed the query
    query_emb = model.embed(query)

    # 2. Retrieve top-k relevant documents
    scores = [cosine_similarity(query_emb, doc.embedding)
              for doc in documents]
    top_docs = sorted(zip(scores, documents), reverse=True)[:3]

    # 3. Augment the prompt with retrieved context
    context = "\n".join(doc.text for _, doc in top_docs)
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

    # 4. Generate answer
    return model.generate(prompt)

Build It

This code shows two patterns: RAG finds the most relevant documents and pastes them into the prompt before asking the model, and an agent loop that keeps thinking and acting until the task is finished.

import numpy as np

# RAG: retrieve relevant context
def rag(query, documents, embeddings, top_k=3):
    q_emb = embed(query)
    scores = embeddings @ q_emb          # cosine similarity
    top_idx = np.argsort(scores)[-top_k:]
    context = "\n".join(documents[i] for i in top_idx)
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
    return llm(prompt)

# Agent loop
def agent(task):
    history = [{"role": "user", "content": task}]
    while True:
        response = llm(history)
        if response.tool_call:
            result = execute_tool(response.tool_call)
            history.append({"role": "tool", "content": result})
        else:
            return response.text

Under the Hood

In plain English: A brief look at the math and code powering this concept.

The hardest part of RAG is deciding how to break documents into pieces the model can digest. RAG chunking strategies: fixed-size (simple), semantic (split at paragraph boundaries), recursive (split large chunks further). Embedding models (like BERT — Bidirectional Encoder Representations from Transformers) produce vectors for similarity search. For scale, use approximate nearest neighbor search (HNSW — Hierarchical Navigable Small World, a fast lookup structure) — exact search is O(n). Agent challenges: errors accumulate over long trajectories, and the model must plan with imperfect information.

Key Takeaway

RAG is like giving the amnesia expert a few relevant reference pages before they answer your question — it fetches the right information so the AI does not have to rely on memory alone.
Tool use gives the AI hands — it can search the web, run calculations, or query a database, just like you would reach for a calculator or phone.
An agent is an AI that works independently: it thinks about what to do, does it, checks the result, and repeats — like a diligent assistant who keeps going until the job is done.

Builds on: RAG, Tool Use & Agents

Evaluation & Practical Considerations 🔗

Imagine giving someone a multiple-choice test where every question has a different number of options. Think of perplexity (a measure of how "surprised" the model is) like a multiple-choice test: a perplexity of 10 means the model is choosing among roughly 10 equally likely options for each word. Lower is better — a well-trained model is rarely surprised. To make models cheaper to run, there is a trick called quantization (rounding detailed measurements to whole numbers) — think of it like rounding 3.14159 to just 3. The answer is close enough for practical use, but the math is much faster. A 7-billion-parameter model that normally needs 14 gigabytes of memory can be squeezed into about 3.5 gigabytes using 4-bit quantization, with barely any drop in quality.

$$\text{Perplexity} = \exp\!\left(-\frac{1}{T}\sum_{t=1}^{T}\log P(x_t|x_{

import numpy as np

def perplexity(log_probs):
    """Compute perplexity from log probabilities."""
    avg_log_prob = np.mean(log_probs)
    return np.exp(-avg_log_prob)

def accuracy(predictions, labels):
    """Simple classification accuracy."""
    return np.mean(np.array(predictions) == np.array(labels))

# Practical model selection criteria:
# - Task: classification, generation, reasoning?
# - Latency: real-time vs. batch?
# - Cost: tokens per dollar?
# - Privacy: can data leave your infrastructure?

Build It

This code measures how confused the model is (perplexity) and how accurately it classifies things (precision, recall, and F1 score) — the basic report card for any AI.

import numpy as np

# Perplexity: how surprised is the model?
log_probs = np.array([-2.3, -1.1, -0.5, -3.2, -1.8])
perplexity = np.exp(-np.mean(log_probs))
# perplexity ≈ 6.0 → choosing from ~6 equally likely options

# Precision, Recall, F1
def f1_score(tp, fp, fn):
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    if precision + recall == 0:
        return 0
    return 2 * precision * recall / (precision + recall)

Under the Hood

In plain English: A brief look at the math and code powering this concept.

Rounding works so well because most of the model's internal numbers are clustered near zero, so you lose very little by using fewer decimal places. Quantization works because weight distributions are approximately Gaussian — most values cluster near zero, so you only need a few bits. INT4 (4-bit integers) with outlier handling using formats like GPTQ and AWQ (popular quantization methods) loses <1% accuracy. Memory math: 7 billion parameters × 2 bytes (fp16) = 14GB; 7B × 0.5 bytes (4-bit) ≈ 3.5GB + overhead. Common benchmarks: MMLU (Massive Multitask Language Understanding — tests knowledge), HumanEval (tests coding ability), GPQA (tests graduate-level reasoning) — but benchmarks can be gamed, so always evaluate on YOUR task.

Key Takeaway

Perplexity is the AI's confusion score — like counting how many equally likely choices it is torn between for each word. Lower means the model understands the text better.
Quantization is like rounding detailed measurements to whole numbers — 4-bit quantization shrinks a model to one quarter of its original memory with barely any drop in quality, making it practical to run on everyday hardware.
Standardized tests can be gamed — always test the model on your own real-world task, because a high benchmark score does not guarantee it works well for your specific need.

Builds on: Evaluation & Practical

Capstone — Build a Tiny LM 🔗

After learning what every part of an engine does, it is time to build a small working engine and watch it run. This final section brings together every concept you have learned into one complete, tiny language model that trains and generates text right in your browser. It uses word-to-number mappings (embeddings, Section 15), word-order stamps (positional encoding, Section 22), the "hearing your name at a party" mechanism (attention, Section 17), the factory processing steps with skip connections (FFN with residuals, Section 18), the scoring system (cross-entropy loss, Section 8), and the blindfolded hill-walking optimizer (gradient descent, Section 5). The model has only about 10,000-15,000 adjustable settings and can learn from Shakespeare in minutes.

\mathcal{L} = -\frac{1}{T}\sum_{t=1}^{T}\log \frac{\exp(z_{y_t})}{\sum_j \exp(z_j)}

import numpy as np

class TinyLM:
    """A complete tiny language model."""
    def __init__(self, vocab_size=256, d_model=64,
                 n_heads=4, n_layers=2, max_len=128):
        self.tok_emb = np.random.randn(vocab_size, d_model) * 0.02
        self.pos_emb = np.random.randn(max_len, d_model) * 0.02
        self.blocks = [TransformerBlock(d_model, n_heads)
                       for _ in range(n_layers)]
        self.ln_f = LayerNorm(d_model)
        self.head = np.random.randn(d_model, vocab_size) * 0.02

    def forward(self, token_ids):
        T = len(token_ids)
        x = self.tok_emb[token_ids] + self.pos_emb[:T]
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = x @ self.head
        return logits

    def generate(self, prompt_ids, max_new=50, temp=0.8):
        ids = list(prompt_ids)
        for _ in range(max_new):
            logits = self.forward(np.array(ids))[-1]
            next_id = sample_with_temperature(logits, temp)
            ids.append(next_id)
        return ids

Build It

This code defines the complete tiny language model — it sets up all the parts (embeddings, attention weights, feed-forward layers) and wires them together so text goes in one end and predictions come out the other.

import numpy as np

class TinyLM:
    def __init__(self, vocab_size=40, d_model=32, n_heads=2, d_ff=64, ctx_len=32):
        s = 0.02
        self.embed = np.random.randn(vocab_size, d_model) * s   # token embedding
        self.pos = np.random.randn(ctx_len, d_model) * s         # position embedding
        self.Wq = np.random.randn(d_model, d_model) * s          # attention
        self.Wk = np.random.randn(d_model, d_model) * s
        self.Wv = np.random.randn(d_model, d_model) * s
        self.Wo = np.random.randn(d_model, d_model) * s
        self.W1 = np.random.randn(d_model, d_ff) * s             # FFN
        self.W2 = np.random.randn(d_ff, d_model) * s
        # LM head = embed.T (weight tying)

    def forward(self, token_ids):
        x = self.embed[token_ids] + self.pos[:len(token_ids)]
        # ... attention, FFN, residuals (see visualization)
        logits = x @ self.embed.T    # weight-tied LM head
        return logits

Under the Hood

In plain English: A brief look at the math and code powering this concept.

This tiny model is a miniature version of the same design used by the largest AI systems in the world. This model implements every concept: embedding lookup (Section 15), positional encoding (Section 22), RMSNorm (Section 13), multi-head causal attention with √d_k scaling (Section 17), residual connections (Section 13), FFN with GELU (Section 18), and weight-tied LM head (Section 19). Character-level tokenization avoids needing BPE. Total params: ~10K-15K — small enough to train in your browser.

Key Takeaway

Everything you learned — from tiny voting machines (neurons) to the attention party trick — comes together in this one working model, like assembling an engine from parts you already understand.
At its core, a language model does four things: turn words into numbers, figure out which words matter to each other, process that information, and predict the next word.
The exact same design works whether the model has 10,000 settings or over a trillion — bigger just means it can learn more patterns and give better answers.

Builds on: How LLMs Work

RLHF & Alignment 🔗

Imagine teaching a parrot to talk. First, you let it listen to millions of conversations so it learns the grammar and vocabulary (Pre-training). But right now, it might say something rude, random, or repetitive! To make it a helpful assistant, you give it a treat every time it answers politely and correctly, and ignore it when it acts up. This is Reinforcement Learning from Human Feedback (RLHF). We use humans to rate the model's answers, train a separate "Judge" model to understand what humans like, and then use that Judge to continuously "reward" the main AI until perfectly helpful behavior becomes its second nature.

Under the Hood

In plain English: A brief look at the math and code powering this concept.

RLHF typically uses PPO (Proximal Policy Optimization). The model generates multiple answers, humans rank them, and we train a Reward Model (using cross-entropy to predict the preferred response). The main LLM acts as the "policy", and its weights are updated to maximize the expected reward from the Reward Model, while adding a KL-divergence penalty so it doesn't drift too far from its original language capabilities.

Key Takeaway

Pre-training teaches the AI how to speak, but RLHF teaches it what to say to be helpful and safe.
Humans rate answers, a "Judge" AI learns those preferences, and the Judge trains the main AI by giving it scores.

Chapter 6

Generative Models Beyond Text

AI is not just about writing text. What about generating stunning images or composing music? In this chapter, we explore how AI creates visuals from scratch not by predicting the next word, but by shaping pure static noise into a masterpiece.

~10 min

Builds on: Neural Networks

Diffusion Models & Images 🔗

Imagine if a sculptor started with a shapeless block of clay and slowly chiseled away the rough edges until a perfect statue emerged. Diffusion models (the technology behind Midjourney, Stable Diffusion, and DALL-E) work exactly like this. First, they learn how to systematically destroy an image by adding "TV static" (noise) to it step-by-step until it's completely unrecognizable. Then, they learn to run that process in reverse: starting with pure static and slowly removing the noise to reveal a brand new, highly detailed image. Pair this reverse-noise generator with an understanding of text prompts, and the AI can "sculpt" whatever you ask for out of thin air.

Under the Hood

In plain English: A brief look at the math and code powering this concept.

A diffusion model (often a U-Net architecture) doesn't predict the final image directly. It predicts the noise that was added to the image at a specific timestep. By taking a sample of pure Gaussian noise and repeatedly subtracting the network's noise predictions over 20-50 steps, it generates a clean image. Latent Diffusion Models (like Stable Diffusion) perform this noising/denoising in a compressed "latent space" rather than pixel space, making them drastically faster and cheaper to run.

Key Takeaway

AI systems like Midjourney create art by starting with pure "TV static" and systematically denoising it step-by-step.
The AI learns this by studying images being ruined by noise, then practicing how to reverse the damage.
By combining this denoising engine with a text-understanding model, the AI knows what shapes to leave behind to match your prompt.