MLP From Scratch (NumPy Only) - Teaching Version
This version leaves selected sections as TODOs for class and includes optional clickable hints.
Plan¶
Create a non-linear binary classification dataset.
Implement building blocks (
sigmoid,relu, dense layers, BCE loss).Write forward and backward passes.
Train the network.
Visualize model behaviour and discuss extensions.
Throughout the notebook, short Try it prompts are included for students.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(7)
1) Build a toy dataset¶
We create a two-moons style dataset (non-linear, but compact enough for a from-scratch implementation).
To keep the task realistic, we use a moderate noise level so the model should improve a lot, but not trivially hit 100% validation accuracy.
def make_moons_np(n_samples=1000, noise=0.24):
"""Return X shape (N, 2), y shape (N, 1)."""
n0 = n_samples // 2
n1 = n_samples - n0
t0 = np.random.rand(n0) * np.pi
t1 = np.random.rand(n1) * np.pi
x0 = np.column_stack([np.cos(t0), np.sin(t0)])
x1 = np.column_stack([1.0 - np.cos(t1), -np.sin(t1) + 0.5])
X = np.vstack([x0, x1])
X += noise * np.random.randn(*X.shape)
y = np.concatenate([np.zeros(n0), np.ones(n1)])[:, None]
idx = np.random.permutation(X.shape[0])
return X[idx], y[idx]
def train_val_split(X, y, val_fraction=0.2):
n = X.shape[0]
n_val = int(n * val_fraction)
X_val, y_val = X[:n_val], y[:n_val]
X_train, y_train = X[n_val:], y[n_val:]
return X_train, y_train, X_val, y_val
X, y = make_moons_np(n_samples=1200, noise=0.44)
X_train, y_train, X_val, y_val = train_val_split(X, y, val_fraction=0.2)
print("Train:", X_train.shape, y_train.shape)
print("Val: ", X_val.shape, y_val.shape)
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train[:, 0], s=10, cmap="coolwarm", alpha=0.75)
ax.set_title("Training data")
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_aspect("equal", adjustable="box")
ax.grid(alpha=0.25)
for spine in ["left", "right", "top", "bottom"]:
ax.spines[spine].set_visible(True)
plt.show()
2) MLP math primitives¶
Network layout for this practical: 2 -> 32 -> 32 -> 1.
For each layer:
Defined functions:
Sigmoid:
ReLU:
ReLU derivative (used in backprop):
Binary cross-entropy (BCE):
Accuracy:
Why EPS?¶
In BCE we compute log(p) and log(1-p). If p becomes exactly 0 or 1, logs blow up to -inf.
We therefore clip probabilities to [EPS, 1-EPS] with a tiny constant EPS=1e-8 for numerical stability.
Hint
EPS = 1e-8
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
def relu(x):
return np.maximum(0.0, x)
def d_relu(x):
return (x > 0).astype(float)EPS = 1e-8 # numerical safety for log in BCE
def sigmoid(x):
# TODO: implement sigmoid
raise NotImplementedError
def relu(x):
# TODO: implement ReLU
raise NotImplementedError
def d_relu(x):
# TODO: implement derivative of ReLU
raise NotImplementedError
def binary_cross_entropy(y_true, y_prob):
y_prob = np.clip(y_prob, EPS, 1.0 - EPS)
return -np.mean(y_true * np.log(y_prob) + (1.0 - y_true) * np.log(1.0 - y_prob))
def accuracy(y_true, y_prob, threshold=0.5):
y_pred = (y_prob >= threshold).astype(int)
return np.mean(y_pred == y_true)
Visual check: activation function shapes¶
This helps students connect the formulas to the curves used in forward/backward propagation.
x = np.linspace(-6, 6, 400)
fig, ax = plt.subplots(1, 3, figsize=(14, 4))
ax[0].plot(x, sigmoid(x), color="tab:blue")
ax[0].set_title("sigmoid(x)")
ax[0].set_xlabel("x")
ax[0].set_ylabel("y")
ax[0].grid(alpha=0.25)
ax[1].plot(x, relu(x), color="tab:orange")
ax[1].set_title("relu(x)")
ax[1].set_xlabel("x")
ax[1].set_ylabel("y")
ax[1].grid(alpha=0.25)
ax[2].plot(x, d_relu(x), color="tab:green")
ax[2].set_title("d_relu(x)")
ax[2].set_xlabel("x")
ax[2].set_ylabel("y")
ax[2].set_ylim(-0.1, 1.1)
ax[2].grid(alpha=0.25)
for a in ax:
for spine in ["left", "right", "top", "bottom"]:
a.spines[spine].set_visible(True)
plt.tight_layout()
plt.show()
3) Initialize parameters¶
We use He initialization for ReLU hidden layers:
which is equivalent to scaling a standard normal by .
Why this helps: with ReLU networks, this keeps activation/gradient magnitudes in a healthy range early in training.
Reference: K. He, X. Zhang, S. Ren, J. Sun (2015), Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.
Hint
def init_mlp(layer_sizes, seed=0):
rng = np.random.default_rng(seed)
params = {}
for l in range(1, len(layer_sizes)):
fan_in = layer_sizes[l - 1]
params[f"W{l}"] = rng.standard_normal((fan_in, layer_sizes[l])) * np.sqrt(2.0 / fan_in)
params[f"b{l}"] = np.zeros((1, layer_sizes[l]))
return paramsdef init_mlp(layer_sizes, seed=0):
"""layer_sizes example: [2, 32, 32, 1]"""
rng = np.random.default_rng(seed)
params = {}
for l in range(1, len(layer_sizes)):
fan_in = layer_sizes[l - 1]
# TODO: initialize W_l with He initialization
params[f"W{l}"] = None
# TODO: initialize b_l as zeros
params[f"b{l}"] = None
return params
layer_sizes = [2, 32, 32, 1]
params = init_mlp(layer_sizes, seed=123)
for k, v in params.items():
print(f"{k}: {v.shape}")
4) Forward pass¶
What is cache?¶
cache is a dictionary storing intermediate tensors (A0, Z1, A1, ..., ZL, AL) from the forward pass.
Backprop needs these values to compute gradients efficiently.
What is happening in the code?¶
Set
A0 = X(input activations).For hidden layers (
1..L-1):compute linear step:
apply ReLU:
store both in
cache
For output layer
L:compute
apply sigmoid to get probability
Return predictions and
cache.
Hint
def forward_pass(X, params):
cache = {"A0": X}
L = len(params) // 2
A = X
for l in range(1, L):
Z = A @ params[f"W{l}"] + params[f"b{l}"]
A = relu(Z)
cache[f"Z{l}"] = Z
cache[f"A{l}"] = A
ZL = A @ params[f"W{L}"] + params[f"b{L}"]
AL = sigmoid(ZL)
cache[f"Z{L}"] = ZL
cache[f"A{L}"] = AL
return AL, cachedef forward_pass(X, params):
"""
Returns:
y_prob: output probabilities, shape (N, 1)
cache: intermediates for backprop
"""
cache = {"A0": X}
L = len(params) // 2
A = X
for l in range(1, L):
# TODO: compute the linear step for layer l
Z = None
# TODO: apply the hidden activation
A = None
cache[f"Z{l}"] = Z
cache[f"A{l}"] = A
# TODO: compute the output-layer linear step
ZL = None
# TODO: apply sigmoid to get output probabilities
AL = None
cache[f"Z{L}"] = ZL
cache[f"A{L}"] = AL
return AL, cache
y_prob, cache = forward_pass(X_train[:5], params)
print("y_prob shape:", y_prob.shape)
print(y_prob[:3].ravel())
5) Backward pass¶
Backprop uses the chain rule repeatedly from output to input.
1) Notation: common syntax compression¶
In code, we use compact names:
dZ[l]meansdW[l]meansdb[l]meansdA[l]means
This is just shorthand to keep equations and code readable.
2) Output layer derivation (L)¶
For the final layer:
With BCE + sigmoid, we get the useful simplification:
Then parameter gradients are:
Why the transpose on ? It aligns matrix shapes so has the same shape as .
3) Hidden-layer recursion (general layer l)¶
For layers :
where is elementwise multiplication, and here so .
4) Code-to-equation mapping¶
dZ = AL - y_truegrads[f"dW{L}"] = (A_prev.T @ dZ) / Ngrads[f"db{L}"] = np.sum(dZ, axis=0, keepdims=True) / NdA = dZ @ W.TdZ = dA * d_relu(Z)then the same
dW,dbformulas for each hidden layer.
def backward_pass(y_true, params, cache):
grads = {}
L = len(params) // 2
N = y_true.shape[0]
# Output layer gradients
A_prev = cache[f"A{L-1}"]
AL = cache[f"A{L}"]
dZ = AL - y_true # BCE + sigmoid simplification
grads[f"dW{L}"] = (A_prev.T @ dZ) / N
grads[f"db{L}"] = np.sum(dZ, axis=0, keepdims=True) / N
# Propagate backwards through hidden layers
for l in range(L - 1, 0, -1):
dA = dZ @ params[f"W{l+1}"].T
dZ = dA * d_relu(cache[f"Z{l}"])
A_prev = cache[f"A{l-1}"]
grads[f"dW{l}"] = (A_prev.T @ dZ) / N
grads[f"db{l}"] = np.sum(dZ, axis=0, keepdims=True) / N
return grads
grads = backward_pass(y_train[:5], params, cache)
for k, v in grads.items():
print(f"{k}: {v.shape}")
6) Parameter update + mini-batches¶
Parameter update¶
We apply gradient descent after each mini-batch:
where is the learning rate.
Mini-batches¶
Instead of one update on the full dataset, we:
shuffle indices,
split data into chunks of size
batch_size,run forward/backward/update per chunk.
This is usually faster and gives a useful stochastic training signal.
def update_params(params, grads, lr=0.05):
L = len(params) // 2
for l in range(1, L + 1):
params[f"W{l}"] -= lr * grads[f"dW{l}"]
params[f"b{l}"] -= lr * grads[f"db{l}"]
def iterate_minibatches(X, y, batch_size=64, shuffle=True):
n = X.shape[0]
idx = np.arange(n)
if shuffle:
np.random.shuffle(idx)
for start in range(0, n, batch_size):
end = start + batch_size
batch_idx = idx[start:end]
yield X[batch_idx], y[batch_idx]
7) Training loop¶
def train_mlp(
X_train,
y_train,
X_val,
y_val,
layer_sizes=(2, 32, 32, 1),
epochs=400,
lr=0.05,
batch_size=64,
seed=123,
):
params = init_mlp(list(layer_sizes), seed=seed)
history = {
"train_loss": [],
"val_loss": [],
"train_acc": [],
"val_acc": [],
}
for epoch in range(1, epochs + 1):
for xb, yb in iterate_minibatches(X_train, y_train, batch_size=batch_size, shuffle=True):
yb_prob, cache = forward_pass(xb, params)
grads = backward_pass(yb, params, cache)
update_params(params, grads, lr=lr)
train_prob, _ = forward_pass(X_train, params)
val_prob, _ = forward_pass(X_val, params)
train_loss = binary_cross_entropy(y_train, train_prob)
val_loss = binary_cross_entropy(y_val, val_prob)
train_acc = accuracy(y_train, train_prob)
val_acc = accuracy(y_val, val_prob)
history["train_loss"].append(train_loss)
history["val_loss"].append(val_loss)
history["train_acc"].append(train_acc)
history["val_acc"].append(val_acc)
if epoch % 50 == 0 or epoch == 1:
print(
f"epoch {epoch:4d} | "
f"train_loss={train_loss:.4f} val_loss={val_loss:.4f} | "
f"train_acc={train_acc:.3f} val_acc={val_acc:.3f}"
)
return params, history
params, history = train_mlp(
X_train,
y_train,
X_val,
y_val,
layer_sizes=(2, 32, 32, 1),
epochs=400,
lr=0.03,
batch_size=64,
seed=123,
)
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].plot(history["train_loss"], label="train")
ax[0].plot(history["val_loss"], label="val")
ax[0].set_title("Binary cross-entropy")
ax[0].set_xlabel("Epoch")
ax[0].set_ylabel("Loss")
ax[0].grid(alpha=0.25)
ax[0].legend()
ax[1].plot(history["train_acc"], label="train")
ax[1].plot(history["val_acc"], label="val")
ax[1].set_title("Accuracy")
ax[1].set_xlabel("Epoch")
ax[1].set_ylabel("Accuracy")
ax[1].grid(alpha=0.25)
ax[1].legend()
for a in ax:
for spine in ["left", "right", "top", "bottom"]:
a.spines[spine].set_visible(True)
plt.tight_layout()
plt.show()
print("Final train accuracy:", history["train_acc"][-1])
print("Final val accuracy: ", history["val_acc"][-1])
8) Visualize decision boundary¶
def plot_decision_boundary(X, y, params, title="Decision boundary"):
x_min, x_max = X[:, 0].min() - 0.2, X[:, 0].max() + 0.2
y_min, y_max = X[:, 1].min() - 0.2, X[:, 1].max() + 0.2
xx, yy = np.meshgrid(
np.linspace(x_min, x_max, 300),
np.linspace(y_min, y_max, 300),
)
grid = np.column_stack([xx.ravel(), yy.ravel()])
probs, _ = forward_pass(grid, params)
# Force plain float arrays for compatibility with older matplotlib/numpy stacks.
Z = np.asarray(probs, dtype=np.float64).reshape(xx.shape)
fig, ax = plt.subplots(figsize=(6, 6))
levels = np.linspace(0.0, 1.0, 31)
ax.contourf(xx, yy, Z, levels=levels, cmap="coolwarm", alpha=0.5)
ax.contour(xx, yy, Z, levels=[0.5], colors="k", linewidths=1.5)
ax.scatter(X[:, 0], X[:, 1], c=y[:, 0], s=10, cmap="coolwarm", edgecolor="k", linewidth=0.1)
ax.set_title(title)
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_aspect("equal", adjustable="box")
ax.grid(alpha=0.2)
for spine in ["left", "right", "top", "bottom"]:
ax.spines[spine].set_visible(True)
plt.show()
plot_decision_boundary(X_val, y_val, params, title="MLP decision boundary on validation set")
9) Try It (student tasks)¶
Use the trained baseline from this notebook as your reference setup unless the task says otherwise:
layer_sizes=(2, 32, 32, 1)epochs=400lr=0.03batch_size=64seed=123
For each task, keep all other settings fixed so you can make a fair comparison.
Change width/depth. Train these three alternatives:
(2, 8, 1),(2, 64, 64, 1),(2, 32, 32, 32, 1). Compare the final train/validation loss and accuracy, and inspect whether the decision boundary becomes too simple or more flexible.Replace ReLU with
tanhin the hidden layers. Keep the same baseline architecture(2, 32, 32, 1). Compare the loss curves, final validation accuracy, and whether optimization looks faster or slower in the first 50 to 100 epochs.Add L2 regularization to the
dWupdates. L2 regularization adds a penalty for large weights:where
This discourages very large weights and usually makes the model smoother. In backpropagation, this adds an extra term to each weight gradient:
In words: after computing the usual gradient for each weight matrix, add a fraction of the weights themselves to that gradient. Bias gradients are usually left unchanged. For this task, use a slightly larger model so the effect is easier to observe:
layer_sizes=(2, 64, 64, 64, 1). Compare two runs:lambda = 0.0andlambda = 0.2. To track the overall weight size during training, add this helper function:def weight_l2_norm(params): L = len(params) // 2 total = sum(np.sum(params[f"W{l}"] ** 2) for l in range(1, L + 1)) return np.sqrt(total)Then add a new history entry in
train_mlp:history = { "train_loss": [], "val_loss": [], "train_acc": [], "val_acc": [], "weight_norm": [], }After computing the metrics for an epoch, append the weight norm in the same part of the training loop where the other history values are stored:
history["weight_norm"].append(weight_l2_norm(params))Compare training loss, validation loss, training accuracy, validation accuracy, and the weight norm curve. Check whether
lambda = 0.2keeps the weights smaller and changes the train/validation gap. Also compare the decision boundaries, how does lambda effect this?Implement early stopping on validation loss. Use the baseline settings first, then if early stopping never triggers, increase model capacity or epoch count. A good stress test is to try a larger model such as
(2, 64, 64, 64, 1)withepochs=1000. Record the best validation epoch, the stopping epoch, and whether restoring the best parameters improves the final validation result.Stress test the noise level in
make_moons_np. Try a few values such as0.10,0.24, and0.44. Compare both the curves and the decision boundary, and describe how increasing noise changes smoothness, ambiguity, and overfitting risk.
Optional challenge: implement momentum optimizer and compare with plain SGD.