Foundations
No math degree required. We start with everyday analogies—recipes, spreadsheets, test scores, and blindfolded hill walking—to build your intuition for how machines learn from examples. By the end of this chapter you will understand the core loop behind every AI model.
~15 minWhat is AI/ML? 🔗
Imagine learning to cook from a recipe versus learning to cook by tasting hundreds of dishes and figuring out what works. Traditional programming is the recipe approach — a programmer writes out every rule by hand: "if this, then that." Machine Learning (ML) flips this — instead of writing rules, you show the computer lots of examples of inputs and correct answers, and it figures out the rules on its own. The thing the computer builds from those examples is called a model, and it is really just a formula with some adjustable knobs (called parameters) that turn inputs into outputs.
Build It
This code gives the computer five example data points and lets it discover the pattern (a straight line) on its own — no rules written by hand.
import numpy as np
# The entire idea of ML in 8 lines
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2.1, 4.0, 5.8, 8.1, 9.9])
# Closed-form solution (OLS)
X_b = np.c_[X, np.ones(len(X))] # add bias column
w = np.linalg.lstsq(X_b, y, rcond=None)[0]
print(f"y = {w[0]:.2f}x + {w[1]:.2f}") # y ≈ 2.0x + 0.1
Under the Hood
Even though the model learned on its own, the result is just two numbers — a slope and a starting point — so the entire "brain" fits in 16 bytes.
The 'model' here is two float64 numbers: a slope and an intercept — 16 bytes total. lstsq uses QR decomposition internally (O(nd²)), not matrix inversion (numerically unstable). This closed-form solution works for linear regression but doesn't scale to complex models — that's why we'll need gradient descent.
Key Takeaway
- Traditional programming is like following a recipe; ML is like learning to cook by tasting — the computer finds the rules from examples instead of being told them
- A model is just a formula with adjustable knobs — turn the knobs until the answers come out right
- Even drawing a single best-fit line through five points captures the entire idea of machine learning
Data & Features 🔗
Think of a spreadsheet where every row is one house you are looking at, and every column is a measurement — square footage, number of bedrooms, distance to the nearest school. That is exactly how a computer sees data: rows are examples, columns are measurements (called features). Raw information like text or photos first has to be converted into numbers so the computer can work with it. One last trick: if square footage ranges up to 3,000 but bedrooms only go up to 5, the big numbers would push the small ones around. So we rescale everything to a similar range — a step called feature scaling (or standardization).
Build It
This code takes a tiny spreadsheet of houses (square footage and bedrooms) and rescales the columns so no single measurement hogs the spotlight.
import numpy as np
# Feature scaling: standardization
raw = np.array([[1500, 3], [2000, 4], [1200, 2]]) # [sq_ft, bedrooms]
means = raw.mean(axis=0) # per-feature mean
stds = raw.std(axis=0) # per-feature std
X = (raw - means) / stds # scaled: mean=0, std=1
# Shape: (n_samples, n_features) = (3, 2)
print(f"Shape: {X.shape}, Means: {X.mean(axis=0)}")
Under the Hood
Without rescaling, the learning process swerves wildly because one column's numbers are a thousand times bigger than another's — like trying to steer a car where the gas pedal is a thousand times more sensitive than the brake.
The feature matrix is a contiguous block of memory — (n_samples × n_features) × 8 bytes for float64. Standardization is O(n) per feature. Without scaling, gradient descent zigzags: a feature ranging 0-1000 creates gradients 1000× larger than a feature ranging 0-1, causing inefficient optimization.
Key Takeaway
- The computer sees all data as a spreadsheet of numbers — rows are examples, columns are measurements
- Rescaling (feature scaling) puts every column on an equal playing field, like converting miles and kilometers to the same unit
- Skip this step and the learning process wobbles instead of heading straight for the answer
Linear Regression 🔗
Picture a scatter of dots on a graph — say, house sizes along the bottom and prices up the side. Now lay a ruler across those dots so it passes as close to all of them as possible. That ruler is your first model, and the technique is called linear regression. It predicts the output by multiplying each input by a weight (how much that input matters), adding them up, and tossing in one extra number called a bias (where the line crosses the zero mark). Simple as it is, it introduces the three ideas behind every AI model: adjustable numbers (parameters), making a guess (prediction), and a neat bookkeeping shortcut called the bias trick.
import numpy as np
# Linear regression prediction
def predict(x, w, b):
return w * x + b
# Example
x = np.array([1, 2, 3, 4, 5])
w, b = 2.0, 1.0
y_hat = predict(x, w, b) # [3, 5, 7, 9, 11]
Build It
This code finds the single best-fit straight line through five data points — the computer figures out the slope and starting point all by itself.
import numpy as np
# The normal equation: w = (X^T X)^{-1} X^T y
X = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([2.1, 4.0, 5.8, 8.1, 9.9])
# Bias trick: append column of 1s
X_b = np.c_[X.reshape(-1, 1), np.ones(len(X))]
# Solve (use lstsq, NOT inv — numerically stable)
w = np.linalg.lstsq(X_b, y, rcond=None)[0]
print(f"y = {w[0]:.2f}x + {w[1]:.2f}")
Under the Hood
There is a shortcut formula that finds the perfect line in one shot, but it gets painfully slow when you have lots of measurements — which is why later we will learn a step-by-step approach instead.
Matrix inversion is O(d³) where d = number of features. The bias trick folds b into the weight vector by appending a column of 1s to X, so [w, b] @ [x, 1] = w*x + b. Warning: when regularizing later (Section 12), do NOT penalize the bias column.
Key Takeaway
- Linear regression is a ruler laid across your data — the simplest possible model, capturing the idea of "multiply, add up, predict"
- The bias trick is a bookkeeping shortcut that bundles the line's starting point into the same math as the slope
- The one-shot formula works great for small problems but chokes on big ones — motivating the step-by-step method coming next
Loss Functions — MSE 🔗
Think of a test score, but flipped: zero means perfect and the bigger the number, the worse you did. That is exactly what a loss function does — it looks at the model's guesses and the correct answers, and produces a single "wrongness score." The most common version is called Mean Squared Error (MSE). It works by checking how far off each guess was, squaring those gaps (so a big miss counts way more than a small one), and averaging them all together. Lower is always better.
import numpy as np
def mse_loss(y_true, y_pred):
"""Mean Squared Error loss."""
return np.mean((y_true - y_pred) ** 2)
# Example
y_true = np.array([3.0, 5.0, 7.0])
y_pred = np.array([2.8, 5.1, 7.3])
loss = mse_loss(y_true, y_pred) # 0.0467
Build It
This code calculates how wrong a set of predictions is (the loss) and which direction each guess should move to get closer to the right answer (the gradient).
import numpy as np
y_true = np.array([3.0, 5.0, 7.0, 9.0])
y_pred = np.array([2.8, 5.2, 6.5, 9.1])
# MSE: mean of squared residuals
mse = np.mean((y_pred - y_true) ** 2)
# Gradient: points toward improvement
d_mse = 2 * (y_pred - y_true) / len(y_true)
print(f"MSE: {mse:.4f}")
print(f"Gradient: {d_mse}")
Under the Hood
Squaring the errors is not just for show — it makes big misses hurt a lot more than small ones, which pushes the model to fix its worst guesses first.
The loss collapses all prediction errors into a single scalar. The gradient tells us which direction to adjust each prediction. Squaring serves two purposes: it's differentiable everywhere (unlike absolute value), and it penalizes large errors more than small ones. Cross-entropy (introduced in Section 8) is used instead when outputs are probabilities.
Key Takeaway
- A loss function is a "wrongness score" — like a test score where zero is perfect and bigger is worse
- MSE (Mean Squared Error) averages the squared gaps between guesses and answers — big misses get penalized heavily
- The gradient is a signpost that tells the model which way to adjust to shrink that wrongness score
Gradient Descent 🔗
Imagine you are blindfolded in the middle of a hilly field and you need to find the lowest valley. You cannot see, but you can feel the ground sloping under your feet. Each step, you figure out which direction goes downhill and take a small step that way. This process of feeling your way downhill is called gradient descent, and it is how nearly every AI model learns. The size of each step is controlled by a setting called the learning rate. Take steps that are too big and you leap right over the valley; too small and you inch along forever.
def gradient_descent_step(w, b, x, y, lr=0.01):
"""One step of gradient descent for linear regression."""
n = len(x)
y_pred = w * x + b
dw = (-2/n) * np.sum(x * (y - y_pred))
db = (-2/n) * np.sum(y - y_pred)
w -= lr * dw
b -= lr * db
return w, b
Build It
This code starts with a random guess and repeatedly nudges it downhill until it lands on the answer (3.0) — the blindfolded-hill-walking idea in five lines.
import numpy as np
# Gradient descent in 5 lines
w = np.random.randn() # random start
lr = 0.01 # learning rate
for step in range(100):
grad = 2 * (w - 3.0) # gradient of L = (w-3)^2
w = w - lr * grad # THE update rule
print(f"Step {step}: w={w:.4f}, loss={(w-3)**2:.6f}")
Under the Hood
The actual "step downhill" is trivially cheap; almost all the work goes into figuring out which direction is downhill in the first place.
The weight update w -= lr * gradient is O(1) per parameter. ALL the computational cost is in computing the gradient. SGD (Stochastic Gradient Descent) estimates the gradient from a random mini-batch instead of the full dataset — noisier but much faster per step. Adam adds momentum and adaptive learning rates per parameter.
Key Takeaway
- Gradient descent is the blindfolded hill walk: feel which way slopes down, take a small step, repeat
- The learning rate is your step size — the single most important setting to get right
- Fancier versions — SGD (Stochastic Gradient Descent, which samples a small batch instead of all the data) and Adam (which builds up momentum like a rolling ball) — take shortcuts or build up speed, but the core idea never changes: step downhill
The Training Loop 🔗
Teaching a dog a new trick follows a predictable pattern: you show the trick, watch what the dog does wrong, figure out how to correct it, and adjust your approach — then repeat. Every AI model learns the same way, in a four-step cycle: (1) make a guess (the forward pass), (2) check how wrong the guess was (the loss), (3) figure out which knobs to turn and how far (the backward pass — computing gradients), and (4) actually turn those knobs (the update). One full trip through all the training examples is called an epoch, and the model gets a little better with each lap.
for epoch in range(num_epochs):
for x_batch, y_batch in dataloader:
# Forward pass
y_pred = model(x_batch)
loss = loss_fn(y_pred, y_batch)
# Backward pass
loss.backward()
# Update parameters
optimizer.step()
optimizer.zero_grad()
Build It
This code runs the full four-step loop (guess, score, figure out correction, adjust) fifty times on 100 data points, and you can watch the loss drop as the model learns.
import numpy as np
X = np.random.randn(100, 1) # 100 samples, 1 feature
y = 2.5 * X + 1.0 + np.random.randn(100, 1) * 0.3
w, b = np.random.randn(), 0.0
lr = 0.01
for epoch in range(50):
# 1. Forward
y_pred = w * X + b
# 2. Loss
loss = np.mean((y_pred - y) ** 2)
# 3. Backward (gradients)
dw = 2 * np.mean(X * (y_pred - y))
db = 2 * np.mean(y_pred - y)
# 4. Update
w -= lr * dw
b -= lr * db
if epoch % 10 == 0:
print(f"Epoch {epoch}: loss={loss:.4f}, w={w:.3f}, b={b:.3f}")
Under the Hood
The four-step loop looks almost identical whether you are training a tiny line-fitting model or a billion-parameter language model — only the size of the math changes.
These 4 lines (dw = X.T @ d_pred for gradients) IS backprop for a linear model. In mini-batch training, shapes go from (features,) to (batch_size, features) — every matmul is batched. Shape mismatches in batch dimensions are the #1 source of bugs in ML code.
Key Takeaway
- Every AI model trains the same way, like teaching a dog: show the trick, see what went wrong, figure out the correction, adjust — repeat
- This four-step loop (guess, score, figure out correction, adjust) is the heartbeat of all deep learning
- Processing examples in small batches (mini-batching) lets the computer chew through data much faster by working on many examples at once
Single Neuron 🔗
Picture a tiny voting machine: several people each cast a vote with different levels of enthusiasm (some shout, some whisper), the machine adds up all those weighted votes, and then makes a yes-or-no decision based on the total. A single neuron works the same way — it takes several input numbers, multiplies each one by a weight (how much that input matters), adds them all up with a nudge value (the bias), and then passes the total through a special gate called an activation function. That gate is the twist: without it, a neuron could only draw straight lines. With it, it can learn curves. The order matters — you add up first, then pass through the gate, not the other way around.
import numpy as np
def neuron(x, w, b):
"""Single neuron with sigmoid activation."""
z = np.dot(w, x) + b
return 1 / (1 + np.exp(-z))
x = np.array([0.5, 0.3, 0.2])
w = np.array([0.4, -0.1, 0.8])
b = 0.1
output = neuron(x, w, b) # ~0.60
Build It
This code builds one tiny voting machine: it multiplies three inputs by three weights, adds them up, and squishes the result through a gate that outputs a number between 0 and 1.
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# A single neuron
x = np.array([0.5, -0.3, 1.2]) # 3 inputs
w = np.array([0.8, -0.5, 1.0]) # 3 weights
b = 0.1 # bias
z = np.dot(w, x) + b # weighted sum: 1.45
a = sigmoid(z) # activation: 0.81
# Sigmoid derivative (computed from output alone!)
da_dz = a * (1 - a) # 0.154
print(f"z={z:.2f}, a={a:.2f}, da/dz={da_dz:.3f}")
Under the Hood
A single neuron does surprisingly little math — one multiplication-and-add, then one squeeze through a gate — but stacking thousands of them creates the power behind modern AI.
A neuron computes a dot product O(d) plus one nonlinear function O(1). The sigmoid derivative a*(1-a) can be computed from the output alone — no need to store z. Note: the biological neuron analogy is extremely loose — real neurons use spike timing, not continuous values.
Key Takeaway
- A neuron is a tiny voting machine — it weighs its inputs, adds them up, and passes the total through a gate to make a decision
- The gate (activation function) is the secret ingredient that lets the model learn curves instead of only straight lines
- The sigmoid gate squishes any number, no matter how huge or negative, into a tidy range between 0 and 1 — handy for yes/no answers
Activation Functions + Softmax + Cross-Entropy 🔗
Imagine a series of gates in a water pipe, each deciding how much water to let through. That is what activation functions do inside a neural network — they control which signals pass and how strongly. The original gate, the sigmoid from the last section, has a problem: it lets less and less water through the deeper the pipe goes, until the flow is practically zero. This fading signal is called the vanishing gradient problem. A newer gate called ReLU (Rectified Linear Unit) fixes this with a dead-simple rule: let positive signals through untouched, block everything negative. Modern AI models use even smoother versions called GELU (Gaussian Error Linear Unit) and SwiGLU (Swish-Gated Linear Unit) — think of them as smarter valves that let a tiny trickle through even for slightly negative values. There is one more tool in this section: softmax, which takes a list of raw scores and converts them into percentages that add up to 100%, and cross-entropy, a loss function that measures how far off those percentages are from reality.
import numpy as np
def relu(z):
return np.maximum(0, z)
def softmax(z):
exp_z = np.exp(z - np.max(z)) # numerical stability
return exp_z / exp_z.sum()
def cross_entropy(y_true, y_pred):
return -np.sum(y_true * np.log(y_pred + 1e-9))
Build It
This code defines several valve types (ReLU, sigmoid, GELU), converts raw scores into percentages with softmax, and measures how wrong those percentages are with cross-entropy loss.
import numpy as np
# Activation functions as one-liners
relu = lambda x: np.maximum(0, x)
sigmoid = lambda x: 1 / (1 + np.exp(-x))
gelu = lambda x: x * 0.5 * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))
# Softmax with numerical stability
def softmax(x):
e = np.exp(x - np.max(x)) # subtract max to prevent overflow
return e / e.sum()
# Cross-entropy loss
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits) # [0.659, 0.242, 0.099]
target = 0 # true class index
loss = -np.log(probs[target]) # 0.417
print(f"Probs: {probs}, Loss: {loss:.3f}")
Under the Hood
The sigmoid valve chokes the learning signal by 75% at every layer — stack ten layers and you have practically zero signal left, which is why deeper networks needed a better valve.
Vanishing gradient: sigmoid's max derivative is 0.25. After 10 layers: 0.25^10 = 9.5×10⁻⁷ — gradients effectively disappear. ReLU's derivative is exactly 1 for positive inputs, so gradients flow undiminished. GELU (x·Φ(x)) is smoother than ReLU and used in GPT/BERT. Modern LLMs (LLaMA, Mistral) use SwiGLU: (xW₁ · swish(xW₃)) @ W₂ — three weight matrices instead of two.
Key Takeaway
- ReLU is a one-way valve that lets positive signals through at full strength, solving the fading-signal problem that plagued earlier valves like sigmoid
- Softmax turns a list of raw scores into percentages that add up to 100% — so the model can say "I am 80% sure it is a cat"
- Cross-entropy is a "wrongness score" for those percentages — the further from reality, the higher the penalty
- Today's most powerful AI models (like GPT and LLaMA) use smoother, more advanced valves called GELU and SwiGLU
Neural Networks
You have seen how a single tiny voting machine (neuron) makes decisions. Now it is time to build a whole team out of them. You will see how groups of neurons draw increasingly clever boundaries between categories, how the network figures out which teammates to blame when it gets an answer wrong, and how to stop it from just memorizing the answers instead of actually learning.
~15 minDecision Boundaries 🔗
Imagine drawing lines on a map to separate two neighborhoods. One straight line can only split the map in two, which works for simple cases but fails when the neighborhoods are interleaved. A single neuron (our tiny voting machine from Chapter 1) can only draw one straight cut like that. But give it a few teammates, and the group can draw multiple lines that combine into curved, complex borders — separating even the trickiest layouts. These borders are called decision boundaries. Watch below how adding more neurons transforms a single straight cut into flexible curves that can separate any arrangement of data points.
# Visualize decision boundary of a trained model
import numpy as np
def plot_decision_boundary(model, X, y, resolution=100):
x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
xx, yy = np.meshgrid(
np.linspace(x_min, x_max, resolution),
np.linspace(y_min, y_max, resolution)
)
grid = np.c_[xx.ravel(), yy.ravel()]
preds = model.predict(grid).reshape(xx.shape)
# Plot contour of predictions
Build It
This code builds a small two-layer network and runs it on the XOR problem — a classic pattern that a single straight line cannot separate — then maps out the decision boundary across a grid.
import numpy as np
# XOR: not linearly separable
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y = np.array([0, 1, 1, 0])
# 2-layer network: 2 → 4 → 1
np.random.seed(42)
W1 = np.random.randn(2, 4) * 0.5
b1 = np.zeros(4)
W2 = np.random.randn(4, 1) * 0.5
b2 = np.zeros(1)
sigmoid = lambda z: 1 / (1 + np.exp(-z))
# Forward pass
h = sigmoid(X @ W1 + b1) # hidden layer
out = sigmoid(h @ W2 + b2) # output
# Evaluate on a grid for the decision boundary
xx, yy = np.mgrid[0:1:0.01, 0:1:0.01]
grid = np.c_[xx.ravel(), yy.ravel()]
Under the Hood
Think of each neuron as drawing one straight line on the map, then the activation functions bend and combine those lines into curves. Each neuron contributes one linear boundary. Combined through nonlinear activations, they form arbitrarily complex regions. The grid evaluation is brute force — forward pass at every pixel. For a 100×100 grid with a network of width w: O(100² × w²) operations per frame.
Key Takeaway
- One neuron can only draw one straight line on the map — but a group of neurons working together can carve out any shape you need
- Some patterns (like XOR) are impossible to separate with a single straight cut — you need at least two neurons teaming up
- As the network learns (adjusts its weights during training), you can watch these boundary lines shift and reshape in real time
Neural Networks & Layers 🔗
Picture a relay race where each runner takes the baton, does one simple thing to it, and passes it on. That is essentially how a neural network works — a team of specialists arranged in a chain, where each person does one simple job and passes the result to the next. The first person looks at the raw input, picks out a few things, and hands a summary to the second person, who refines it further, and so on until the last person delivers the final answer. Each “person” in this chain is a layer — it takes in numbers, multiplies them by a set of weights (how much attention to pay to each input), adds a nudge (called a bias), and runs the result through an activation function (the one-way valve from Chapter 1). Stacking these layers one after another is called a forward pass: input flows in one end and a prediction comes out the other.
import numpy as np
class DenseLayer:
def __init__(self, n_in, n_out):
self.W = np.random.randn(n_out, n_in) * 0.01
self.b = np.zeros((n_out, 1))
def forward(self, x):
self.x = x
z = self.W @ x + self.b
return np.maximum(0, z) # ReLU
Build It
This code creates a full neural network class with multiple layers, runs a forward pass (input in, prediction out), and counts the total number of adjustable settings (parameters) the network has.
import numpy as np
class NeuralNetwork:
def __init__(self, layers):
# He initialization: scale by sqrt(2/fan_in)
self.params = []
for i in range(len(layers) - 1):
W = np.random.randn(layers[i], layers[i+1]) * np.sqrt(2 / layers[i])
b = np.zeros(layers[i+1])
self.params.append((W, b))
def forward(self, X):
self.activations = [X] # cache for backprop
for i, (W, b) in enumerate(self.params):
z = X @ W + b
X = np.maximum(0, z) if i < len(self.params)-1 else 1/(1+np.exp(-z))
self.activations.append(X)
return X
net = NeuralNetwork([3, 4, 4, 2]) # 3→4→4→2
out = net.forward(np.random.randn(1, 3))
# Param count: 3×4+4 + 4×4+4 + 4×2+2 = 42
Under the Hood
Under the surface, a neural network is just a list of number grids (weight matrices) with a nudge value (bias) for each layer. A network = `list[(W, b)]`. He initialization scales weights by √(2/fan_in) to prevent activations from exploding or collapsing to zero. Parameter count: Σ(layer_i × layer_{i+1} + layer_{i+1}). Caching activations during forward pass is essential — backprop needs them to compute gradients.
Key Takeaway
- A neural network is just a chain of specialists — each layer does a simple calculation and passes the result forward, like a relay race
- The starting values of the weights matter: if they are too big or too small, the signal either explodes or fades to nothing before it reaches the end (a technique called He initialization picks smart starting values)
- The entire “thinking” process is the forward pass — data flows through every layer in order, and out comes a prediction
Backpropagation 🔗
Imagine a teacher grading a group project and tracing blame backward: “the conclusion was weak because the analysis was wrong, which happened because the data was misread.” That is exactly how a neural network learns from its mistakes. After the network makes a prediction (the forward pass), it checks how wrong the answer was (the loss score from Chapter 1). Then it works backward through every layer, asking: “How much did you contribute to this mistake?” Each layer gets a blame score that tells it how to adjust its weights. This backward blame-tracing process is called backpropagation (or “backprop” for short). It is the engine that makes the training loop from Chapter 1 actually work for networks with many layers.
class Value:
"""Tiny autograd engine (inspired by micrograd)."""
def __init__(self, data, _children=(), _op=''):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._children = set(_children)
def __mul__(self, other):
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
Build It
This code runs a two-layer network forward to get a prediction, then traces blame backward through every layer to figure out how each weight should change.
import numpy as np
sigmoid = lambda z: 1 / (1 + np.exp(-z))
# 2-layer network: forward then backward
X = np.array([[0.5, -0.3]]) # (1, 2)
y = np.array([[0.8]]) # target
W1 = np.random.randn(2, 3) * 0.5
b1 = np.zeros(3)
W2 = np.random.randn(3, 1) * 0.5
b2 = np.zeros(1)
# Forward (cache z and a at each layer)
z1 = X @ W1 + b1; a1 = sigmoid(z1)
z2 = a1 @ W2 + b2; a2 = sigmoid(z2)
loss = np.mean((a2 - y) ** 2)
# Backward: three formulas per layer
dL_da2 = 2 * (a2 - y)
dL_dz2 = dL_da2 * a2 * (1 - a2) # through activation
dL_dW2 = a1.T @ dL_dz2 # OUTER PRODUCT, not ⊗
dL_db2 = dL_dz2.sum(axis=0)
dL_da1 = dL_dz2 @ W2.T # propagate backward
dL_dz1 = dL_da1 * a1 * (1 - a1)
dL_dW1 = X.T @ dL_dz1
dL_db1 = dL_dz1.sum(axis=0)
Under the Hood
The backward pass is essentially the teacher walking from the final answer back to the first step, handing out blame at each stop. Backprop costs roughly the same as the forward pass. Memory requirement: 2× inference because you must store all z and a from the forward pass. Common bugs: (1) transposing the wrong matrix in the weight gradient, (2) forgetting to cache activations, (3) not zeroing gradients between iterations. The weight gradient `dL/dW = a_prev.T @ dL/dz` is an outer product — this is the most commonly confused operation in backprop.
Key Takeaway
- Backpropagation is like a teacher grading a group project: it traces blame backward from the final answer through every layer to find out who made the mistake
- Each layer gets a “blame score” (called a gradient) that tells it exactly how to adjust its weights — the same hill-feeling process from Chapter 1, but now applied layer by layer
- Training uses roughly twice the memory of just making predictions, because the network has to remember its work from the forward pass so the backward pass can assign blame
Overfitting & Regularization 🔗
Think of a student who memorizes the answer key word-for-word instead of actually learning the subject — they ace every practice test but bomb the real exam because the questions are slightly different. Neural networks can do the same thing: instead of learning general patterns, they memorize the specific training examples, including the random noise and quirks. This is called overfitting. The warning sign is easy to spot: the network's error on its practice data (training loss) keeps going down, but its error on new, unseen data (validation loss) starts going up. To prevent this, we use tricks called regularization. One approach (called L2 regularization or weight decay) penalizes the network for having large weights, nudging it toward simpler solutions. Another (called dropout) randomly turns off some neurons during each training step, forcing the network to not rely too heavily on any single neuron.
import numpy as np
def l2_regularized_loss(y_true, y_pred, weights, lambda_=0.01):
"""MSE loss with L2 (weight decay) regularization."""
mse = np.mean((y_true - y_pred) ** 2)
l2_penalty = lambda_ * sum(np.sum(w**2) for w in weights)
return mse + l2_penalty
def dropout(h, p=0.5, training=True):
"""Randomly zero out neurons during training."""
if not training:
return h
mask = np.random.binomial(1, 1-p, size=h.shape) / (1-p)
return h * mask
Build It
This code shows two ways to keep a network from memorizing: L2 regularization (penalizing large weights) and dropout (randomly silencing neurons during training).
import numpy as np
# Ridge regression: L2 regularization
X = np.random.randn(50, 5)
y = X @ np.array([1, 2, 0, -1, 0.5]) + np.random.randn(50) * 0.3
lam = 1.0 # regularization strength
# Closed-form: w = (X^T X + λI)^{-1} X^T y
# NOTE: do NOT penalize the bias column
XtX = X.T @ X
w_ridge = np.linalg.solve(XtX + lam * np.eye(5), X.T @ y)
# Dropout (training time)
def dropout(a, p=0.5):
mask = np.random.binomial(1, 1-p, size=a.shape)
return a * mask / (1 - p) # scale to maintain expected value
Under the Hood
Both techniques work by making the network keep things simple — small weights mean gentler, smoother predictions rather than wild, spiky ones. L2 regularization adds λ||w||² to the loss, pushing weights toward zero. The term +λI also makes X^TX invertible (useful when features > samples). Dropout during training creates a different ‘sub-network’ each step — at test time, it’s equivalent to an ensemble average. The 1/(1-p) scaling ensures activations have the same expected value during training and inference.
Key Takeaway
- Overfitting is like memorizing the answer key — great on practice tests, terrible on the real exam. The warning sign is when validation loss starts climbing while training loss keeps falling.
- L2 regularization is like a “keep it simple” rule — it penalizes the network for using big weights, nudging it toward smoother, more general solutions
- Dropout randomly turns off neurons during training, like forcing a team to practice without their star player — everyone else has to step up, making the whole team more resilient
Residual Connections & Normalization 🔗
Imagine a long game of telephone: by the time a message passes through 50 people, it is completely garbled. Deep neural networks have the same problem — information and learning signals get weaker with every layer they pass through. Surprisingly, a 50-layer network can perform worse than a 20-layer one, even on data it has already seen (so it is not just memorizing poorly). This is called the degradation problem. The fix is beautifully simple: give each person in the telephone game a written copy of the original message alongside the whispered one. In network terms, you add a shortcut that lets the original input skip over a layer and get added directly to that layer’s output. This shortcut is called a residual connection (or skip connection). It creates a highway for information and learning signals to flow through, even in networks with hundreds of layers. Alongside residual connections, modern networks also use normalization — a step that keeps the numbers flowing through the network in a reasonable range, like an editor making sure each person’s notes are the same font size before passing them along.
import numpy as np
def residual_block(x, W1, b1, W2, b2):
"""A simple residual block: x + F(x)."""
h = np.maximum(0, W1 @ x + b1) # ReLU
out = W2 @ h + b2
return x + out # skip connection
def layer_norm(x, gamma, beta, eps=1e-5):
"""Layer normalization."""
mu = np.mean(x, axis=-1, keepdims=True)
sigma = np.std(x, axis=-1, keepdims=True)
return gamma * (x - mu) / (sigma + eps) + beta
Build It
This code builds a residual block (where the input gets added back to the output) and a normalization step, the two ingredients that let modern networks grow to hundreds of layers deep.
import numpy as np
# Residual block: the +x is the entire innovation
def residual_block(x, W1, W2):
z = np.maximum(0, x @ W1) # ReLU(x @ W1)
F_x = z @ W2 # second linear
return np.maximum(0, F_x + x) # F(x) + x, then ReLU
# RMSNorm (used in LLaMA/Mistral — faster than LayerNorm)
def rms_norm(x, eps=1e-6):
return x / np.sqrt(np.mean(x ** 2) + eps)
# Pre-LN transformer block (modern standard)
# x_out = x + Attention(rms_norm(x))
# x_out = x_out + FFN(rms_norm(x_out))
Under the Hood
The “written copy of the original message” means the learning signal always has a clean path through the network, even if some layers are struggling. The gradient through a residual block is dy/dx = dF/dx + I. Even if dF/dx vanishes, the identity term I keeps gradient magnitude at 1. Zero extra parameters. The key motivation: the degradation problem — a 56-layer plain network had higher training error than a 20-layer one. Pre-LN (Pre-Layer Normalization — normalizing before each sublayer instead of after) is the modern standard used in GPT-2+, Claude, and LLaMA. RMSNorm (Root Mean Square Normalization) skips mean-centering for ~15% speedup.
Key Takeaway
- Residual connections are like giving each person in a telephone game a written copy of the original message — even if the whispered version gets garbled, the written one keeps the information intact
- Normalization keeps the numbers flowing through the network in a tidy range, like an editor standardizing everyone's handwriting so the next person can read it clearly
- These two tricks together are what let modern AI models stack hundreds of layers deep without the signal falling apart
Supplement Section 14: CNNs & RNNs — Click to expand
CNNs & RNNs 🔗
Imagine scanning a photo with a magnifying glass, inch by inch, looking for familiar patterns. That is essentially how one older AI design works. A CNN (Convolutional Neural Network) works like a magnifying glass sliding across an image — it looks at one small patch at a time, checking for patterns like edges, curves, or textures, then slides over to the next patch. By stacking layers, it builds up from simple edges to complex features like faces or cars. An RNN (Recurrent Neural Network) works like reading a book one word at a time while keeping a mental summary of everything you have read so far — each new word updates that running summary, so the network can handle sequences like sentences or time series. Both designs have been largely replaced by a newer architecture called the transformer (which you will meet in Chapter 4), but the ideas behind them still show up everywhere.
import numpy as np
def conv1d(x, kernel):
"""Simple 1D convolution (no padding)."""
k_len = len(kernel)
out_len = len(x) - k_len + 1
return np.array([np.dot(x[i:i+k_len], kernel) for i in range(out_len)])
def rnn_step(x_t, h_prev, W_h, W_x, b):
"""Single RNN step."""
return np.tanh(W_h @ h_prev + W_x @ x_t + b)
Build It
This code shows a CNN filter sliding across a signal to detect edges, and a single step of an RNN updating its running summary with new input.
import numpy as np
# 1D Convolution: a filter slides across input
signal = np.array([1, 0, 2, 3, 1, 0, 1])
kernel = np.array([1, 0, -1]) # edge detector
conv = np.convolve(signal, kernel, mode='valid')
# conv = [1, -2, 1, 2, 0]
# Simple RNN step
def rnn_step(x_t, h_prev, W_xh, W_hh, b):
h_t = np.tanh(x_t @ W_xh + h_prev @ W_hh + b)
return h_t
Under the Hood
The key efficiency trick: the magnifying glass (CNN filter) uses the same lens everywhere it looks, and the reader (RNN) uses the same “how to update my summary” rule at every step. CNNs share weights across spatial positions (a 3×3 filter has only 9 params regardless of image size). RNNs share weights across time steps but suffer from vanishing gradients over long sequences — LSTMs (Long Short-Term Memory networks) added gates to control information flow. Transformers replaced both by processing all positions in parallel via attention.
Key Takeaway
- A CNN is like a magnifying glass that slides across an image, checking each small patch for patterns like edges and textures — it uses the same lens everywhere, so it needs very few settings to learn
- An RNN reads a sequence one step at a time while keeping a running summary of what came before — but for very long sequences, the early memories tend to fade (a problem called vanishing gradients)
- Transformers (coming in Chapter 4) replaced both by looking at all parts of the input at once, which is faster and handles long-range patterns better
Representation
Computers only understand numbers, not words. Before an AI can read a sentence, every word has to be translated into a list of numbers — and the sentence itself has to be chopped into bite-sized pieces. This chapter shows you how that translation and chopping happen.
~8 minEmbeddings 🔗
Imagine a map where cities are placed by similarity — Paris near Rome because both are European capitals, Tokyo near Seoul because both are Asian capitals. Embeddings do exactly this for words: they place each word on a “map” made of numbers, so that related words end up close together. The simplest approach — giving each word its own switch in a giant row of off-switches, then flipping just one on (called “one-hot encoding”) — wastes space and tells you nothing about which words are related. Embeddings fix this by giving every word a compact list of numbers (a “dense vector”) where similar words get similar numbers. Under the hood, it is just a table lookup — embedding = matrix[token_id] — but the table is learned through training (backprop, the “trace blame backward” process from Chapter 2).
import numpy as np
# Simple embedding lookup table
vocab_size, embed_dim = 1000, 64
E = np.random.randn(vocab_size, embed_dim) * 0.01
def embed(token_id):
return E[token_id]
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Build It
This code creates a word-to-numbers lookup table and a function that checks how similar two words are by comparing their number lists.
import numpy as np
# Embedding is just array indexing
vocab_size, d_model = 50000, 768
embedding_matrix = np.random.randn(vocab_size, d_model) * 0.02
token_id = 1234
embedding = embedding_matrix[token_id] # (768,) — that's it!
# Cosine similarity: are two words related?
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Under the Hood
Looking up an embedding is as cheap as copying a row from a spreadsheet — no heavy math at all. Embedding lookup is O(d) — just copying d numbers. No multiplication. GPT-2's embedding matrix: 50,257 × 768 × 4 bytes = ~148MB. Embeddings are learned through backprop: gradients only flow to the rows that were looked up (sparse updates). The 'king - man + woman ≈ queen' arithmetic works because the model learns consistent vector offsets for semantic relationships.
Key Takeaway
- An embedding is like a GPS coordinate for a word — it turns a word into a list of numbers that capture its meaning.
- Words with similar meanings land near each other on the “map,” just like Paris and Rome sit close on a real map.
- The whole process is just looking up a row in a table, and the table improves as the model trains.
Tokenization 🔗
Like a child learning to read by sounding out syllables — “un-happi-ness” — AI models break text into small, manageable pieces before they can process it. These pieces are called “tokens.” Splitting letter by letter makes sentences painfully long. Splitting by whole words breaks when the model meets a word it has never seen. A technique called BPE (Byte Pair Encoding — a method that builds a vocabulary by repeatedly gluing together the most common neighboring pieces) finds the sweet spot: it starts with individual letters, then keeps merging the pairs that appear together most often. After about 50,000 merges you get a vocabulary that can handle any text efficiently.
def simple_bpe_step(corpus, num_merges=10):
"""Simplified BPE: repeatedly merge most frequent pair."""
# Start with character-level tokens
tokens = [list(word) + ['</w>'] for word in corpus]
for _ in range(num_merges):
# Count all adjacent pairs
pairs = {}
for word_tokens in tokens:
for i in range(len(word_tokens) - 1):
pair = (word_tokens[i], word_tokens[i+1])
pairs[pair] = pairs.get(pair, 0) + 1
# Merge the most frequent pair
best = max(pairs, key=pairs.get)
# ... merge logic
return tokens
Build It
This code starts with individual letters and repeatedly merges the most common pair, showing you each merge step — just like BPE builds its vocabulary.
# Simplified BPE implementation
def bpe(text, num_merges=10):
tokens = list(text) # start with characters
for i in range(num_merges):
# Count all adjacent pairs
pairs = {}
for j in range(len(tokens) - 1):
pair = (tokens[j], tokens[j+1])
pairs[pair] = pairs.get(pair, 0) + 1
if not pairs:
break
# Merge most frequent pair
best = max(pairs, key=pairs.get)
new_tokens = []
j = 0
while j < len(tokens):
if j < len(tokens)-1 and (tokens[j], tokens[j+1]) == best:
new_tokens.append(tokens[j] + tokens[j+1])
j += 2
else:
new_tokens.append(tokens[j])
j += 1
tokens = new_tokens
print(f"Merge {i+1}: '{best[0]}'+'{best[1]}' → '{best[0]+best[1]}'")
return tokens
Under the Hood
Think of BPE as a zip file for language — it finds repeating patterns and squishes them together to save space. BPE is greedy compression. GPT-2 uses ~50K merges starting from bytes (byte-level BPE). Vocab size tradeoffs: larger vocab = more embedding params but fewer tokens per sequence. English averages ~1.3 tokens/word; CJK languages use 2-3× more tokens per word — this means shorter effective context windows for non-English text.
Key Takeaway
- BPE builds a dictionary by gluing together letter-pairs that often appear side by side — like noticing “th” shows up everywhere in English and making it one piece.
- The size of this dictionary is a trade-off: a bigger dictionary means the model needs more memory, and some languages get shortchanged with fewer entries.
- This chopping-up step happens before the AI ever sees the text — it is the very first stage of the pipeline.
Attention & Transformers
This is where the magic happens. You will learn how an AI decides which words in a sentence matter most to each other, how those decisions are stacked into a powerful assembly-line architecture, and how that architecture powers the chatbots and writing tools you use every day.
~12 minAttention & Multi-Head 🔗
You are at a crowded party and someone across the room says your name — your brain instantly zeros in on that voice and tunes out the noise. AI attention works the same way: each word “listens” to every other word in the sentence and decides which ones matter most right now. To do this, every word creates three things: a Query (“what am I looking for?”), a Key (“what do I contain?”), and a Value (“what information do I carry?”). The model figures out how much two words should pay attention to each other by comparing their Query and Key (using a dot product, which is just a way of measuring similarity). Then it uses softmax (the “pick-a-winner” function from Chapter 1) to turn those raw scores into percentages that add up to 100%. This is the single most important equation in modern AI.
import numpy as np
def scaled_dot_product_attention(Q, K, V):
"""Scaled dot-product attention."""
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
weights = softmax(scores) # along last axis
return weights @ V
def multi_head_attention(x, W_q, W_k, W_v, W_o, n_heads):
"""Multi-head attention (simplified)."""
d = x.shape[-1]
head_dim = d // n_heads
heads = []
for h in range(n_heads):
Q = x @ W_q[h]
K = x @ W_k[h]
V = x @ W_v[h]
heads.append(scaled_dot_product_attention(Q, K, V))
return np.concatenate(heads, axis=-1) @ W_o
Build It
This code computes attention scores between words and then uses multiple “attention heads” so the model can focus on different types of relationships at the same time.
import numpy as np
def attention(Q, K, V):
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # (seq, seq)
weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights /= weights.sum(axis=-1, keepdims=True) # softmax
return weights @ V # (seq, d_v)
# Multi-head: split, attend, concatenate
def multi_head(X, n_heads, Wq, Wk, Wv, Wo):
Q, K, V = X @ Wq, X @ Wk, X @ Wv
d_k = Q.shape[-1] // n_heads
heads = []
for i in range(n_heads):
qi = Q[:, i*d_k:(i+1)*d_k]
ki = K[:, i*d_k:(i+1)*d_k]
vi = V[:, i*d_k:(i+1)*d_k]
heads.append(attention(qi, ki, vi))
return np.concatenate(heads, axis=-1) @ Wo
Under the Hood
The big cost of attention is that every word has to check in with every other word, so doubling the sentence length quadruples the work. Q@K^T creates a (seq_len × seq_len) attention matrix — this is why attention is O(seq² × d). The √d_k scaling prevents dot products from growing too large (when Q,K entries are iid with variance 1, the dot product has variance d_k — dividing by √d_k keeps variance at 1, preventing softmax saturation). Multi-head attention lets the model attend to different types of relationships simultaneously.
Key Takeaway
- Attention is the AI asking “which other words should I pay attention to right now?” — like your brain picking out your name in a noisy room.
- Because every word checks every other word, the work grows rapidly with longer text — this is why chatbots have a limit on how much text they can handle at once.
The Transformer Architecture 🔗
Picture a factory assembly line where each station has two jobs: first, the workers discuss which parts of the project matter most (that is attention); then each worker refines their own piece independently (that is the feed-forward network, or FFN). After each job, every worker keeps a photocopy of what they had before so nothing gets lost (that is the residual connection — the “telephone game with written copies” from Chapter 2). A transformer is just a stack of these identical stations, one after another. The FFN holds roughly two-thirds of all the model's learned information — think of it as the factory's filing cabinet of knowledge. Modern transformers also tidy up the numbers before each step (called “normalization”) to keep things stable.
class TransformerBlock:
def __init__(self, d_model, n_heads):
self.attention = MultiHeadAttention(d_model, n_heads)
self.ffn = FeedForward(d_model, d_model * 4)
self.ln1 = LayerNorm(d_model)
self.ln2 = LayerNorm(d_model)
def forward(self, x):
# Self-attention with residual + norm
h = self.ln1(x + self.attention(x))
# Feed-forward with residual + norm
return self.ln2(h + self.ffn(h))
Build It
This code builds one station of the assembly line: it normalizes, runs attention, adds the residual shortcut, then does the same for the feed-forward step.
import numpy as np
def transformer_block(x, attn_fn, W1, W2, b1, b2):
# Pre-LN: normalize, then sublayer, then add
normed = rms_norm(x)
x = x + attn_fn(normed) # attention + residual
normed = rms_norm(x)
# FFN: expand to 4×d, activate, project back
h = np.maximum(0, normed @ W1 + b1) # (seq, 4*d)
x = x + h @ W2 + b2 # (seq, d) + residual
return x
# Param count per block:
# Attention: 4 × d² (Wq, Wk, Wv, Wo)
# FFN: 2 × d × 4d = 8d²
# Total per block: ~12d²
# LLaMA-7B: d=4096, N=32 → ~6.7B params
Under the Hood
The feed-forward network is where the model stores most of what it "knows" — facts, grammar rules, and patterns it picked up during training. The FFN expands to 4× d_model, applies an activation, and projects back. SwiGLU (used in LLaMA/Mistral) uses 3 matrices: (xW₁ · swish(xW₃)) @ W₂. Mixture of Experts (MoE) replaces one FFN with N expert FFNs + a router — more parameters without proportional compute (Mixtral has 46.7B params but only uses 12.9B per forward pass).
Key Takeaway
- A transformer is a stack of identical stations, each combining a “discussion round” (attention) with a “solo refinement step” (FFN), plus photocopied shortcuts (residuals) and tidying-up (normalization).
- The FFN is the filing cabinet — it holds the bulk of what the model has memorized, from facts to grammar.
- A clever trick called Mixture of Experts (MoE) lets a model have a huge filing cabinet but only open a few drawers at a time, keeping it fast.
How LLMs Work 🔗
Think of an incredibly well-read autocomplete — the kind that has read billions of web pages, books, and articles. A Large Language Model (LLM) does one thing: predict the next word. Your text goes in, gets chopped into tokens (Section 16), translated into number-lists called embeddings (Section 15), and then passed through the transformer assembly line (Section 18). At the end, the model looks at every word in its dictionary and assigns each one a probability — “how likely is this word to come next?” The winner (or a randomly chosen high-scorer) becomes the next word, and the whole process repeats, one word at a time.
class SimpleLM:
def __init__(self, vocab_size, d_model, n_layers, n_heads):
self.embed = EmbeddingTable(vocab_size, d_model)
self.blocks = [TransformerBlock(d_model, n_heads)
for _ in range(n_layers)]
self.ln_f = LayerNorm(d_model)
self.head = Linear(d_model, vocab_size)
def forward(self, token_ids):
x = self.embed(token_ids)
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
logits = self.head(x) # (seq_len, vocab_size)
return logits
Build It
This code takes the transformer's final output for a sentence, scores every word in the dictionary, applies a “temperature” dial (higher = more creative, lower = more predictable), and picks the next word.
import numpy as np
# The LM head: project hidden state to vocabulary
d_model, vocab_size = 768, 50000
h_last = np.random.randn(d_model) # last hidden state
W_head = np.random.randn(vocab_size, d_model) # often = embedding.T
logits = W_head @ h_last # (50000,)
# Temperature scaling
T = 0.8
probs = np.exp((logits - logits.max()) / T)
probs /= probs.sum() # softmax with temperature
next_token = np.random.choice(vocab_size, p=probs)
Under the Hood
Running an LLM is like passing a message through every station on the assembly line, then checking every word in the dictionary at the end — the longer the message and the bigger the dictionary, the more work it takes. Full forward pass cost: embedding O(seq×d), attention O(N×seq²×d), FFN O(N×seq×d²), LM head O(vocab×d). Weight tying: the LM head often shares the embedding matrix transposed (W_head = embedding.T), saving vocab×d parameters. During training, every position predicts the next token simultaneously — seq_len training examples from one sequence.
Key Takeaway
- An LLM is a next-word predictor: it looks at everything written so far and guesses what comes next, like the world's most well-read autocomplete.
- The final step (called the “LM head”) converts the transformer's internal numbers into a probability for every word in the dictionary.
- A common memory-saving trick: the same table used to convert words into numbers at the start is reused in reverse at the end (called “weight tying”).
Training & Using LLMs
You know how the engine is built — now it is time to drive the car. This chapter shows how an AI learns from massive amounts of text, how it writes responses one word at a time, how you can guide it with clever instructions, and how you can give it access to outside knowledge and tools. By the end, you will build a tiny working language model from scratch.
~15 minTraining LLMs 🔗
Raising a child who learns to speak happens in stages — and training an AI language model follows the same pattern. First, the child listens to millions of conversations and picks up the patterns of language — this is called pre-training, where the model reads vast amounts of text and learns to predict the next word. Then the child learns manners and social rules — this is alignment, where techniques like RLHF (Reinforcement Learning from Human Feedback, meaning humans rate the AI's answers so it learns which responses are helpful and safe) and DPO (Direct Preference Optimization, a simpler way to teach preferences) fine-tune the model's behavior. Finally, the child might specialize for a particular job — this is fine-tuning, and a technique called LoRA (Low-Rank Adaptation) makes this practical by adjusting only a small fraction of the model's settings instead of rewriting everything from scratch.
# Simplified training stages
# Stage 1: Pre-training (next token prediction)
for batch in pretrain_dataloader:
logits = model(batch.input_ids)
loss = cross_entropy(logits, batch.target_ids)
loss.backward()
optimizer.step()
# Stage 2: Supervised Fine-Tuning (SFT)
for batch in instruction_dataloader:
logits = model(batch.prompt + batch.response)
loss = cross_entropy(logits, batch.response) # only on response
loss.backward()
optimizer.step()
# Stage 3: RLHF (simplified)
# Train reward model, then optimize policy with PPO
Build It
This code calculates how wrong the model's guess was and then nudges its settings in the right direction — the same show-observe-correct-repeat cycle from earlier chapters, applied to language.
import numpy as np
# Cross-entropy gradient (elegant simplification)
probs = softmax(logits) # model's predictions
target = 42 # true next token ID
loss = -np.log(probs[target]) # cross-entropy loss
grad = probs.copy()
grad[target] -= 1 # gradient = softmax - one_hot
# AdamW update (decoupled weight decay)
m = beta1 * m + (1 - beta1) * grad # momentum
v = beta2 * v + (1 - beta2) * grad ** 2 # velocity
m_hat = m / (1 - beta1 ** t) # bias correction
v_hat = v / (1 - beta2 ** t)
w = w * (1 - lr * weight_decay) # decoupled decay
w = w - lr * m_hat / (np.sqrt(v_hat) + eps)
Under the Hood
Here is what happens behind the scenes when the model learns from its mistakes. Cross-entropy gradient is simply softmax - one_hot. AdamW (Adam with Weight Decay, a popular optimizer) stores 2× the model parameters (m and v states): a 7B model needs ~56GB just for optimizer state. RLHF trains a reward model on human preferences, then uses PPO (Proximal Policy Optimization, a reinforcement learning algorithm) to optimize. DPO simplifies this by directly optimizing on preference pairs without a reward model. LoRA: freeze base weights, add W + A@B where A and B are small rank-r matrices.
Key Takeaway
- An AI learns language the way a child does — by hearing billions of sentences and getting better at guessing the next word.
- After learning language, it learns manners — alignment (RLHF/DPO) teaches it to be helpful and safe, like a parent correcting behavior.
- You do not have to retrain the whole brain to teach it a new skill — LoRA lets you fine-tune just a small piece, making customization practical even on a laptop.
Inference & Decoding 🔗
An AI writes a story the same way you might — one word at a time, where each new word depends on everything written so far. For each word, the model looks at all the words before it and picks the next one. How it picks matters: always choosing the most obvious word (called greedy decoding) produces dull, predictable text. Adding a bit of randomness (called temperature) makes it more creative. Filters like top-k (only consider the k most likely words) and top-p (only consider words whose combined chances add up to p) keep it from going off the rails. There is also a crucial speed trick called the KV cache (Key-Value cache) — instead of re-reading the entire story from the beginning every time it writes a new word, the model remembers what it already processed, like using a bookmark instead of starting over from page one.
import numpy as np
def sample_with_temperature(logits, temperature=1.0):
"""Sample from logits with temperature scaling."""
scaled = logits / temperature
probs = np.exp(scaled) / np.sum(np.exp(scaled))
return np.random.choice(len(probs), p=probs)
def top_k_sampling(logits, k=10):
"""Keep only top-k logits, zero out the rest."""
indices = np.argsort(logits)[-k:]
mask = np.full_like(logits, -np.inf)
mask[indices] = logits[indices]
probs = np.exp(mask) / np.sum(np.exp(mask))
return np.random.choice(len(probs), p=probs)
Build It
This code picks the next word by adjusting how adventurous the model is (temperature) and filtering out unlikely choices (top-k and top-p), then rolling the dice among the remaining options.
import numpy as np
def sample_token(logits, temperature=1.0, top_k=0, top_p=0.9):
logits = logits / temperature
if top_k > 0:
top_k_idx = np.argsort(logits)[-top_k:]
mask = np.full_like(logits, -np.inf)
mask[top_k_idx] = logits[top_k_idx]
logits = mask
probs = np.exp(logits - logits.max())
probs /= probs.sum()
if top_p < 1.0:
sorted_idx = np.argsort(probs)[::-1]
cumsum = np.cumsum(probs[sorted_idx])
cutoff = sorted_idx[cumsum > top_p]
if len(cutoff) > 0:
probs[cutoff] = 0
probs /= probs.sum() # renormalize!
return np.random.choice(len(probs), p=probs)
Under the Hood
The biggest performance trick in text generation is avoiding redundant work. KV cache: store K,V for each layer (memory: batch × n_layers × seq_len × d_model). For each new token, compute only its Q and attend to all cached K,V. This reduces per-token computation from O(seq²) to O(seq). Paged attention (vLLM) manages KV cache like virtual memory pages to reduce waste. Speculative decoding: a small draft model proposes N tokens, the large model verifies in parallel — up to Nx speedup with identical output.
Key Takeaway
- The KV cache is like using a bookmark — instead of re-reading the whole book for every new word, the model remembers what it already processed, making generation dramatically faster.
- Temperature is a creativity dial: turn it up for surprising, imaginative text; turn it down for safe, predictable answers. Top-k and top-p act as quality filters that remove nonsensical choices.
- Speculative decoding is like having a fast assistant draft several words ahead and the expert just checks them — much faster, with identical results.
Context Windows & Prompting 🔗
Imagine reading a book through a small window that only shows a few pages at a time — that is an AI's context window, its short-term memory. The model can only "see" a limited amount of text at once, and everything outside that window is invisible to it. To keep track of word order, the model uses position encoding, a way of stamping each word with its place in the sentence (like page numbers in a book). Prompting — the art of writing good instructions for AI — works because the instructions, any examples you provide, and your actual question all get fed into this same window as regular text, and the attention mechanism (hearing your name at a party) treats them all equally.
# Prompting strategies
zero_shot = "Translate to French: Hello"
few_shot = """Translate to French:
Hello -> Bonjour
Goodbye -> Au revoir
Thank you -> Merci
Good morning ->"""
chain_of_thought = """Q: If a store has 5 apples and
sells 2, how many remain?
Let's think step by step:
1. Start with 5 apples
2. Sell 2 apples
3. 5 - 2 = 3 apples remain
A: 3"""
Build It
This code creates the "page numbers" that tell the model where each word sits in the sentence, using a wave pattern that gives every position a unique fingerprint.
import numpy as np
# Sinusoidal positional encoding (original Transformer)
def positional_encoding(seq_len, d_model):
pos = np.arange(seq_len)[:, np.newaxis]
dim = np.arange(d_model)[np.newaxis, :]
angle = pos / 10000 ** (2 * (dim // 2) / d_model)
pe = np.zeros((seq_len, d_model))
pe[:, 0::2] = np.sin(angle[:, 0::2])
pe[:, 1::2] = np.cos(angle[:, 1::2])
return pe
# Usage: add to embeddings
# X = token_embeddings + positional_encoding(seq_len, d_model)
Under the Hood
The way models keep track of word order has improved significantly over time. Position encoding evolution: sinusoidal (fixed) → learned absolute (GPT-2) → RoPE (Rotary Position Embedding, used in modern models like LLaMA). RoPE rotates the Query and Key vectors by position-dependent angles, making attention scores depend on relative position (how far apart two words are, not their absolute positions). This enables context extension via NTK-aware scaling (a mathematical trick to stretch the model's window to longer texts than it was trained on). Prompting is not a special mechanism — few-shot examples work because attention sees the pattern and continues it.
Key Takeaway
- The context window is the AI's short-term memory — it can only see a fixed amount of text at once, like reading through a window. Modern position-tracking methods like RoPE (Rotary Position Embedding) let the model understand how far apart words are, and can be stretched to widen that window.
- Prompting is not magic — your instructions, examples, and questions are all just words in the window, and the model pays attention to all of them equally when crafting its response.
RAG, Tool Use & Agents 🔗
Picture an expert with amnesia — brilliant, but they cannot remember anything beyond what is in front of them right now. That is an AI without help: its training data is frozen in the past, and its short-term memory (context window) is limited. RAG (Retrieval-Augmented Generation, which means "fetch relevant reference pages before answering") fixes this by looking up the most useful documents and slipping them into the prompt so the model can read them. Tool use goes further — it lets the model call outside services like a calculator, a search engine, or a database, the way you might pick up a phone to look something up. Agents combine all of this into a loop: the model thinks about what to do, takes an action, observes the result, and repeats until the task is done.
# Simplified RAG pipeline
def rag_answer(query, documents, model):
# 1. Embed the query
query_emb = model.embed(query)
# 2. Retrieve top-k relevant documents
scores = [cosine_similarity(query_emb, doc.embedding)
for doc in documents]
top_docs = sorted(zip(scores, documents), reverse=True)[:3]
# 3. Augment the prompt with retrieved context
context = "\n".join(doc.text for _, doc in top_docs)
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
# 4. Generate answer
return model.generate(prompt)
Build It
This code shows two patterns: RAG finds the most relevant documents and pastes them into the prompt before asking the model, and an agent loop that keeps thinking and acting until the task is finished.
import numpy as np
# RAG: retrieve relevant context
def rag(query, documents, embeddings, top_k=3):
q_emb = embed(query)
scores = embeddings @ q_emb # cosine similarity
top_idx = np.argsort(scores)[-top_k:]
context = "\n".join(documents[i] for i in top_idx)
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
return llm(prompt)
# Agent loop
def agent(task):
history = [{"role": "user", "content": task}]
while True:
response = llm(history)
if response.tool_call:
result = execute_tool(response.tool_call)
history.append({"role": "tool", "content": result})
else:
return response.text
Under the Hood
The hardest part of RAG is deciding how to break documents into pieces the model can digest. RAG chunking strategies: fixed-size (simple), semantic (split at paragraph boundaries), recursive (split large chunks further). Embedding models (like BERT — Bidirectional Encoder Representations from Transformers) produce vectors for similarity search. For scale, use approximate nearest neighbor search (HNSW — Hierarchical Navigable Small World, a fast lookup structure) — exact search is O(n). Agent challenges: errors accumulate over long trajectories, and the model must plan with imperfect information.
Key Takeaway
- RAG is like giving the amnesia expert a few relevant reference pages before they answer your question — it fetches the right information so the AI does not have to rely on memory alone.
- Tool use gives the AI hands — it can search the web, run calculations, or query a database, just like you would reach for a calculator or phone.
- An agent is an AI that works independently: it thinks about what to do, does it, checks the result, and repeats — like a diligent assistant who keeps going until the job is done.
Evaluation & Practical Considerations 🔗
Imagine giving someone a multiple-choice test where every question has a different number of options. Think of perplexity (a measure of how "surprised" the model is) like a multiple-choice test: a perplexity of 10 means the model is choosing among roughly 10 equally likely options for each word. Lower is better — a well-trained model is rarely surprised. To make models cheaper to run, there is a trick called quantization (rounding detailed measurements to whole numbers) — think of it like rounding 3.14159 to just 3. The answer is close enough for practical use, but the math is much faster. A 7-billion-parameter model that normally needs 14 gigabytes of memory can be squeezed into about 3.5 gigabytes using 4-bit quantization, with barely any drop in quality.
import numpy as np
def perplexity(log_probs):
"""Compute perplexity from log probabilities."""
avg_log_prob = np.mean(log_probs)
return np.exp(-avg_log_prob)
def accuracy(predictions, labels):
"""Simple classification accuracy."""
return np.mean(np.array(predictions) == np.array(labels))
# Practical model selection criteria:
# - Task: classification, generation, reasoning?
# - Latency: real-time vs. batch?
# - Cost: tokens per dollar?
# - Privacy: can data leave your infrastructure?
Build It
This code measures how confused the model is (perplexity) and how accurately it classifies things (precision, recall, and F1 score) — the basic report card for any AI.
import numpy as np
# Perplexity: how surprised is the model?
log_probs = np.array([-2.3, -1.1, -0.5, -3.2, -1.8])
perplexity = np.exp(-np.mean(log_probs))
# perplexity ≈ 6.0 → choosing from ~6 equally likely options
# Precision, Recall, F1
def f1_score(tp, fp, fn):
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
if precision + recall == 0:
return 0
return 2 * precision * recall / (precision + recall)
Under the Hood
Rounding works so well because most of the model's internal numbers are clustered near zero, so you lose very little by using fewer decimal places. Quantization works because weight distributions are approximately Gaussian — most values cluster near zero, so you only need a few bits. INT4 (4-bit integers) with outlier handling using formats like GPTQ and AWQ (popular quantization methods) loses <1% accuracy. Memory math: 7 billion parameters × 2 bytes (fp16) = 14GB; 7B × 0.5 bytes (4-bit) ≈ 3.5GB + overhead. Common benchmarks: MMLU (Massive Multitask Language Understanding — tests knowledge), HumanEval (tests coding ability), GPQA (tests graduate-level reasoning) — but benchmarks can be gamed, so always evaluate on YOUR task.
Key Takeaway
- Perplexity is the AI's confusion score — like counting how many equally likely choices it is torn between for each word. Lower means the model understands the text better.
- Quantization is like rounding detailed measurements to whole numbers — 4-bit quantization shrinks a model to one quarter of its original memory with barely any drop in quality, making it practical to run on everyday hardware.
- Standardized tests can be gamed — always test the model on your own real-world task, because a high benchmark score does not guarantee it works well for your specific need.
Capstone — Build a Tiny LM 🔗
After learning what every part of an engine does, it is time to build a small working engine and watch it run. This final section brings together every concept you have learned into one complete, tiny language model that trains and generates text right in your browser. It uses word-to-number mappings (embeddings, Section 15), word-order stamps (positional encoding, Section 22), the "hearing your name at a party" mechanism (attention, Section 17), the factory processing steps with skip connections (FFN with residuals, Section 18), the scoring system (cross-entropy loss, Section 8), and the blindfolded hill-walking optimizer (gradient descent, Section 5). The model has only about 10,000-15,000 adjustable settings and can learn from Shakespeare in minutes.
import numpy as np
class TinyLM:
"""A complete tiny language model."""
def __init__(self, vocab_size=256, d_model=64,
n_heads=4, n_layers=2, max_len=128):
self.tok_emb = np.random.randn(vocab_size, d_model) * 0.02
self.pos_emb = np.random.randn(max_len, d_model) * 0.02
self.blocks = [TransformerBlock(d_model, n_heads)
for _ in range(n_layers)]
self.ln_f = LayerNorm(d_model)
self.head = np.random.randn(d_model, vocab_size) * 0.02
def forward(self, token_ids):
T = len(token_ids)
x = self.tok_emb[token_ids] + self.pos_emb[:T]
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
logits = x @ self.head
return logits
def generate(self, prompt_ids, max_new=50, temp=0.8):
ids = list(prompt_ids)
for _ in range(max_new):
logits = self.forward(np.array(ids))[-1]
next_id = sample_with_temperature(logits, temp)
ids.append(next_id)
return ids
Build It
This code defines the complete tiny language model — it sets up all the parts (embeddings, attention weights, feed-forward layers) and wires them together so text goes in one end and predictions come out the other.
import numpy as np
class TinyLM:
def __init__(self, vocab_size=40, d_model=32, n_heads=2, d_ff=64, ctx_len=32):
s = 0.02
self.embed = np.random.randn(vocab_size, d_model) * s # token embedding
self.pos = np.random.randn(ctx_len, d_model) * s # position embedding
self.Wq = np.random.randn(d_model, d_model) * s # attention
self.Wk = np.random.randn(d_model, d_model) * s
self.Wv = np.random.randn(d_model, d_model) * s
self.Wo = np.random.randn(d_model, d_model) * s
self.W1 = np.random.randn(d_model, d_ff) * s # FFN
self.W2 = np.random.randn(d_ff, d_model) * s
# LM head = embed.T (weight tying)
def forward(self, token_ids):
x = self.embed[token_ids] + self.pos[:len(token_ids)]
# ... attention, FFN, residuals (see visualization)
logits = x @ self.embed.T # weight-tied LM head
return logits
Under the Hood
This tiny model is a miniature version of the same design used by the largest AI systems in the world. This model implements every concept: embedding lookup (Section 15), positional encoding (Section 22), RMSNorm (Section 13), multi-head causal attention with √d_k scaling (Section 17), residual connections (Section 13), FFN with GELU (Section 18), and weight-tied LM head (Section 19). Character-level tokenization avoids needing BPE. Total params: ~10K-15K — small enough to train in your browser.
Key Takeaway
- Everything you learned — from tiny voting machines (neurons) to the attention party trick — comes together in this one working model, like assembling an engine from parts you already understand.
- At its core, a language model does four things: turn words into numbers, figure out which words matter to each other, process that information, and predict the next word.
- The exact same design works whether the model has 10,000 settings or over a trillion — bigger just means it can learn more patterns and give better answers.