Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

MLP From Scratch (NumPy Only) - Teaching Version

This version leaves selected sections as TODOs for class and includes optional clickable hints.

Plan

  1. Create a non-linear binary classification dataset.

  2. Implement building blocks (sigmoid, relu, dense layers, BCE loss).

  3. Write forward and backward passes.

  4. Train the network.

  5. Visualize model behaviour and discuss extensions.

Throughout the notebook, short Try it prompts are included for students.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(7)

1) Build a toy dataset

We create a two-moons style dataset (non-linear, but compact enough for a from-scratch implementation).

To keep the task realistic, we use a moderate noise level so the model should improve a lot, but not trivially hit 100% validation accuracy.

def make_moons_np(n_samples=1000, noise=0.24):
    """Return X shape (N, 2), y shape (N, 1)."""
    n0 = n_samples // 2
    n1 = n_samples - n0

    t0 = np.random.rand(n0) * np.pi
    t1 = np.random.rand(n1) * np.pi

    x0 = np.column_stack([np.cos(t0), np.sin(t0)])
    x1 = np.column_stack([1.0 - np.cos(t1), -np.sin(t1) + 0.5])

    X = np.vstack([x0, x1])
    X += noise * np.random.randn(*X.shape)

    y = np.concatenate([np.zeros(n0), np.ones(n1)])[:, None]

    idx = np.random.permutation(X.shape[0])
    return X[idx], y[idx]


def train_val_split(X, y, val_fraction=0.2):
    n = X.shape[0]
    n_val = int(n * val_fraction)
    X_val, y_val = X[:n_val], y[:n_val]
    X_train, y_train = X[n_val:], y[n_val:]
    return X_train, y_train, X_val, y_val


X, y = make_moons_np(n_samples=1200, noise=0.44)
X_train, y_train, X_val, y_val = train_val_split(X, y, val_fraction=0.2)

print("Train:", X_train.shape, y_train.shape)
print("Val:  ", X_val.shape, y_val.shape)
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train[:, 0], s=10, cmap="coolwarm", alpha=0.75)
ax.set_title("Training data")
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_aspect("equal", adjustable="box")
ax.grid(alpha=0.25)
for spine in ["left", "right", "top", "bottom"]:
    ax.spines[spine].set_visible(True)
plt.show()

2) MLP math primitives

Network layout for this practical: 2 -> 32 -> 32 -> 1.

For each layer:

Z[l]=A[l1]W[l]+b[l],qquadA[l]=g(Z[l])Z^{[l]} = A^{[l-1]}W^{[l]} + b^{[l]}, \\qquad A^{[l]} = g\left(Z^{[l]}\right)

Defined functions:

  • Sigmoid:

    σ(x)=11+ex\sigma(x)=\frac{1}{1+e^{-x}}
  • ReLU:

    ReLU(x)=max(0,x)\mathrm{ReLU}(x)=\max(0,x)
  • ReLU derivative (used in backprop):

    dReLUdx={1,x>00,x0\frac{d\,\mathrm{ReLU}}{dx}= \begin{cases} 1, & x>0 \\ 0, & x\le 0 \end{cases}
  • Binary cross-entropy (BCE):

    LBCE=1Ni=1N[yilog(y^i)+(1yi)log(1y^i)]\mathcal{L}_{\mathrm{BCE}} = -\frac{1}{N}\sum_{i=1}^N\Big[y_i\log(\hat y_i)+(1-y_i)\log(1-\hat y_i)\Big]
  • Accuracy:

    Acc=1Ni=1N1[(y^i0.5)=yi]\mathrm{Acc}=\frac{1}{N}\sum_{i=1}^N\mathbf{1}\big[(\hat y_i\ge 0.5)=y_i\big]

Why EPS?

In BCE we compute log(p) and log(1-p). If p becomes exactly 0 or 1, logs blow up to -inf. We therefore clip probabilities to [EPS, 1-EPS] with a tiny constant EPS=1e-8 for numerical stability.

Hint
EPS = 1e-8


def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))


def relu(x):
    return np.maximum(0.0, x)


def d_relu(x):
    return (x > 0).astype(float)
EPS = 1e-8  # numerical safety for log in BCE


def sigmoid(x):
    # TODO: implement sigmoid
    raise NotImplementedError


def relu(x):
    # TODO: implement ReLU
    raise NotImplementedError


def d_relu(x):
    # TODO: implement derivative of ReLU
    raise NotImplementedError


def binary_cross_entropy(y_true, y_prob):
    y_prob = np.clip(y_prob, EPS, 1.0 - EPS)
    return -np.mean(y_true * np.log(y_prob) + (1.0 - y_true) * np.log(1.0 - y_prob))


def accuracy(y_true, y_prob, threshold=0.5):
    y_pred = (y_prob >= threshold).astype(int)
    return np.mean(y_pred == y_true)

Visual check: activation function shapes

This helps students connect the formulas to the curves used in forward/backward propagation.

x = np.linspace(-6, 6, 400)

fig, ax = plt.subplots(1, 3, figsize=(14, 4))

ax[0].plot(x, sigmoid(x), color="tab:blue")
ax[0].set_title("sigmoid(x)")
ax[0].set_xlabel("x")
ax[0].set_ylabel("y")
ax[0].grid(alpha=0.25)

ax[1].plot(x, relu(x), color="tab:orange")
ax[1].set_title("relu(x)")
ax[1].set_xlabel("x")
ax[1].set_ylabel("y")
ax[1].grid(alpha=0.25)

ax[2].plot(x, d_relu(x), color="tab:green")
ax[2].set_title("d_relu(x)")
ax[2].set_xlabel("x")
ax[2].set_ylabel("y")
ax[2].set_ylim(-0.1, 1.1)
ax[2].grid(alpha=0.25)

for a in ax:
    for spine in ["left", "right", "top", "bottom"]:
        a.spines[spine].set_visible(True)

plt.tight_layout()
plt.show()

3) Initialize parameters

We use He initialization for ReLU hidden layers:

Wij[l]N(0,2nin)W^{[l]}_{ij} \sim \mathcal{N}\left(0,\frac{2}{n_{\mathrm{in}}}\right)

which is equivalent to scaling a standard normal by 2/nin\sqrt{2/n_{\mathrm{in}}}.

Why this helps: with ReLU networks, this keeps activation/gradient magnitudes in a healthy range early in training.

Reference: K. He, X. Zhang, S. Ren, J. Sun (2015), Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.

Hint
def init_mlp(layer_sizes, seed=0):
    rng = np.random.default_rng(seed)
    params = {}
    for l in range(1, len(layer_sizes)):
        fan_in = layer_sizes[l - 1]
        params[f"W{l}"] = rng.standard_normal((fan_in, layer_sizes[l])) * np.sqrt(2.0 / fan_in)
        params[f"b{l}"] = np.zeros((1, layer_sizes[l]))
    return params
def init_mlp(layer_sizes, seed=0):
    """layer_sizes example: [2, 32, 32, 1]"""
    rng = np.random.default_rng(seed)
    params = {}
    for l in range(1, len(layer_sizes)):
        fan_in = layer_sizes[l - 1]
        # TODO: initialize W_l with He initialization
        params[f"W{l}"] = None
        # TODO: initialize b_l as zeros
        params[f"b{l}"] = None
    return params


layer_sizes = [2, 32, 32, 1]
params = init_mlp(layer_sizes, seed=123)

for k, v in params.items():
    print(f"{k}: {v.shape}")

4) Forward pass

What is cache?

cache is a dictionary storing intermediate tensors (A0, Z1, A1, ..., ZL, AL) from the forward pass. Backprop needs these values to compute gradients efficiently.

What is happening in the code?

  1. Set A0 = X (input activations).

  2. For hidden layers (1..L-1):

    • compute linear step: Z[l]=A[l1]W[l]+b[l]Z^{[l]}=A^{[l-1]}W^{[l]}+b^{[l]}

    • apply ReLU: A[l]=ReLU(Z[l])A^{[l]}=\mathrm{ReLU}(Z^{[l]})

    • store both in cache

  3. For output layer L:

    • compute Z[L]Z^{[L]}

    • apply sigmoid to get probability A[L]=y^A^{[L]}=\hat y

  4. Return predictions and cache.

Hint
def forward_pass(X, params):
    cache = {"A0": X}
    L = len(params) // 2

    A = X
    for l in range(1, L):
        Z = A @ params[f"W{l}"] + params[f"b{l}"]
        A = relu(Z)
        cache[f"Z{l}"] = Z
        cache[f"A{l}"] = A

    ZL = A @ params[f"W{L}"] + params[f"b{L}"]
    AL = sigmoid(ZL)
    cache[f"Z{L}"] = ZL
    cache[f"A{L}"] = AL
    return AL, cache
def forward_pass(X, params):
    """
    Returns:
      y_prob: output probabilities, shape (N, 1)
      cache: intermediates for backprop
    """
    cache = {"A0": X}
    L = len(params) // 2

    A = X
    for l in range(1, L):
        # TODO: compute the linear step for layer l
        Z = None
        # TODO: apply the hidden activation
        A = None
        cache[f"Z{l}"] = Z
        cache[f"A{l}"] = A

    # TODO: compute the output-layer linear step
    ZL = None
    # TODO: apply sigmoid to get output probabilities
    AL = None
    cache[f"Z{L}"] = ZL
    cache[f"A{L}"] = AL

    return AL, cache


y_prob, cache = forward_pass(X_train[:5], params)
print("y_prob shape:", y_prob.shape)
print(y_prob[:3].ravel())

5) Backward pass

Backprop uses the chain rule repeatedly from output to input.

1) Notation: common syntax compression

In code, we use compact names:

  • dZ[l] means LZ[l]\frac{\partial \mathcal{L}}{\partial Z^{[l]}}

  • dW[l] means LW[l]\frac{\partial \mathcal{L}}{\partial W^{[l]}}

  • db[l] means Lb[l]\frac{\partial \mathcal{L}}{\partial b^{[l]}}

  • dA[l] means LA[l]\frac{\partial \mathcal{L}}{\partial A^{[l]}}

This is just shorthand to keep equations and code readable.

2) Output layer derivation (L)

For the final layer:

Z[L]=A[L1]W[L]+b[L],A[L]=σ(Z[L])Z^{[L]} = A^{[L-1]}W^{[L]} + b^{[L]}, \qquad A^{[L]} = \sigma(Z^{[L]})

With BCE + sigmoid, we get the useful simplification:

dZ[L]=A[L]ydZ^{[L]} = A^{[L]} - y

Then parameter gradients are:

dW[L]=1N(A[L1])TdZ[L],db[L]=1Ni=1NdZi[L]dW^{[L]} = \frac{1}{N}(A^{[L-1]})^T dZ^{[L]}, \qquad db^{[L]} = \frac{1}{N}\sum_{i=1}^{N} dZ^{[L]}_i

Why the transpose on A[L1]A^{[L-1]}? It aligns matrix shapes so dW[L]dW^{[L]} has the same shape as W[L]W^{[L]}.

3) Hidden-layer recursion (general layer l)

For layers l=L1,,1l=L-1,\dots,1:

dA[l]=dZ[l+1](W[l+1])TdA^{[l]} = dZ^{[l+1]}(W^{[l+1]})^T
dZ[l]=dA[l]g(Z[l])dZ^{[l]} = dA^{[l]} \odot g'\left(Z^{[l]}\right)
dW[l]=1N(A[l1])TdZ[l],db[l]=1Ni=1NdZi[l]dW^{[l]} = \frac{1}{N}(A^{[l-1]})^T dZ^{[l]}, \qquad db^{[l]} = \frac{1}{N}\sum_{i=1}^{N} dZ^{[l]}_i

where \odot is elementwise multiplication, and here g=ReLUg=\mathrm{ReLU} so g(Z[l])=d_relu(Z[l])g'(Z^{[l]})=d\_relu(Z^{[l]}).

4) Code-to-equation mapping

  • dZ = AL - y_true \Rightarrow dZ[L]dZ^{[L]}

  • grads[f"dW{L}"] = (A_prev.T @ dZ) / N \Rightarrow dW[L]dW^{[L]}

  • grads[f"db{L}"] = np.sum(dZ, axis=0, keepdims=True) / N \Rightarrow db[L]db^{[L]}

  • dA = dZ @ W.T \Rightarrow dA[l]=dZ[l+1](W[l+1])TdA^{[l]} = dZ^{[l+1]}(W^{[l+1]})^T

  • dZ = dA * d_relu(Z) \Rightarrow dZ[l]=dA[l]g(Z[l])dZ^{[l]} = dA^{[l]} \odot g'(Z^{[l]})

  • then the same dW, db formulas for each hidden layer.

def backward_pass(y_true, params, cache):
    grads = {}
    L = len(params) // 2
    N = y_true.shape[0]

    # Output layer gradients
    A_prev = cache[f"A{L-1}"]
    AL = cache[f"A{L}"]
    dZ = AL - y_true  # BCE + sigmoid simplification
    grads[f"dW{L}"] = (A_prev.T @ dZ) / N
    grads[f"db{L}"] = np.sum(dZ, axis=0, keepdims=True) / N

    # Propagate backwards through hidden layers
    for l in range(L - 1, 0, -1):
        dA = dZ @ params[f"W{l+1}"].T
        dZ = dA * d_relu(cache[f"Z{l}"])
        A_prev = cache[f"A{l-1}"]
        grads[f"dW{l}"] = (A_prev.T @ dZ) / N
        grads[f"db{l}"] = np.sum(dZ, axis=0, keepdims=True) / N

    return grads


grads = backward_pass(y_train[:5], params, cache)
for k, v in grads.items():
    print(f"{k}: {v.shape}")

6) Parameter update + mini-batches

Parameter update

We apply gradient descent after each mini-batch:

W[l]W[l]ηdW[l],b[l]b[l]ηdb[l]W^{[l]} \leftarrow W^{[l]} - \eta\, dW^{[l]},\qquad b^{[l]} \leftarrow b^{[l]} - \eta\, db^{[l]}

where η\eta is the learning rate.

Mini-batches

Instead of one update on the full dataset, we:

  1. shuffle indices,

  2. split data into chunks of size batch_size,

  3. run forward/backward/update per chunk.

This is usually faster and gives a useful stochastic training signal.

def update_params(params, grads, lr=0.05):
    L = len(params) // 2
    for l in range(1, L + 1):
        params[f"W{l}"] -= lr * grads[f"dW{l}"]
        params[f"b{l}"] -= lr * grads[f"db{l}"]


def iterate_minibatches(X, y, batch_size=64, shuffle=True):
    n = X.shape[0]
    idx = np.arange(n)
    if shuffle:
        np.random.shuffle(idx)

    for start in range(0, n, batch_size):
        end = start + batch_size
        batch_idx = idx[start:end]
        yield X[batch_idx], y[batch_idx]

7) Training loop

def train_mlp(
    X_train,
    y_train,
    X_val,
    y_val,
    layer_sizes=(2, 32, 32, 1),
    epochs=400,
    lr=0.05,
    batch_size=64,
    seed=123,
):
    params = init_mlp(list(layer_sizes), seed=seed)
    history = {
        "train_loss": [],
        "val_loss": [],
        "train_acc": [],
        "val_acc": [],
    }

    for epoch in range(1, epochs + 1):
        for xb, yb in iterate_minibatches(X_train, y_train, batch_size=batch_size, shuffle=True):
            yb_prob, cache = forward_pass(xb, params)
            grads = backward_pass(yb, params, cache)
            update_params(params, grads, lr=lr)

        train_prob, _ = forward_pass(X_train, params)
        val_prob, _ = forward_pass(X_val, params)

        train_loss = binary_cross_entropy(y_train, train_prob)
        val_loss = binary_cross_entropy(y_val, val_prob)
        train_acc = accuracy(y_train, train_prob)
        val_acc = accuracy(y_val, val_prob)

        history["train_loss"].append(train_loss)
        history["val_loss"].append(val_loss)
        history["train_acc"].append(train_acc)
        history["val_acc"].append(val_acc)

        if epoch % 50 == 0 or epoch == 1:
            print(
                f"epoch {epoch:4d} | "
                f"train_loss={train_loss:.4f} val_loss={val_loss:.4f} | "
                f"train_acc={train_acc:.3f} val_acc={val_acc:.3f}"
            )

    return params, history


params, history = train_mlp(
    X_train,
    y_train,
    X_val,
    y_val,
    layer_sizes=(2, 32, 32, 1),
    epochs=400,
    lr=0.03,
    batch_size=64,
    seed=123,
)

fig, ax = plt.subplots(1, 2, figsize=(12, 4))

ax[0].plot(history["train_loss"], label="train")
ax[0].plot(history["val_loss"], label="val")
ax[0].set_title("Binary cross-entropy")
ax[0].set_xlabel("Epoch")
ax[0].set_ylabel("Loss")
ax[0].grid(alpha=0.25)
ax[0].legend()

ax[1].plot(history["train_acc"], label="train")
ax[1].plot(history["val_acc"], label="val")
ax[1].set_title("Accuracy")
ax[1].set_xlabel("Epoch")
ax[1].set_ylabel("Accuracy")
ax[1].grid(alpha=0.25)
ax[1].legend()

for a in ax:
    for spine in ["left", "right", "top", "bottom"]:
        a.spines[spine].set_visible(True)

plt.tight_layout()
plt.show()

print("Final train accuracy:", history["train_acc"][-1])
print("Final val accuracy:  ", history["val_acc"][-1])

8) Visualize decision boundary

def plot_decision_boundary(X, y, params, title="Decision boundary"):
    x_min, x_max = X[:, 0].min() - 0.2, X[:, 0].max() + 0.2
    y_min, y_max = X[:, 1].min() - 0.2, X[:, 1].max() + 0.2
    xx, yy = np.meshgrid(
        np.linspace(x_min, x_max, 300),
        np.linspace(y_min, y_max, 300),
    )

    grid = np.column_stack([xx.ravel(), yy.ravel()])
    probs, _ = forward_pass(grid, params)

    # Force plain float arrays for compatibility with older matplotlib/numpy stacks.
    Z = np.asarray(probs, dtype=np.float64).reshape(xx.shape)

    fig, ax = plt.subplots(figsize=(6, 6))
    levels = np.linspace(0.0, 1.0, 31)
    ax.contourf(xx, yy, Z, levels=levels, cmap="coolwarm", alpha=0.5)
    ax.contour(xx, yy, Z, levels=[0.5], colors="k", linewidths=1.5)
    ax.scatter(X[:, 0], X[:, 1], c=y[:, 0], s=10, cmap="coolwarm", edgecolor="k", linewidth=0.1)
    ax.set_title(title)
    ax.set_xlabel("x1")
    ax.set_ylabel("x2")
    ax.set_aspect("equal", adjustable="box")
    ax.grid(alpha=0.2)
    for spine in ["left", "right", "top", "bottom"]:
        ax.spines[spine].set_visible(True)
    plt.show()


plot_decision_boundary(X_val, y_val, params, title="MLP decision boundary on validation set")

9) Try It (student tasks)

Use the trained baseline from this notebook as your reference setup unless the task says otherwise:

  • layer_sizes=(2, 32, 32, 1)

  • epochs=400

  • lr=0.03

  • batch_size=64

  • seed=123

For each task, keep all other settings fixed so you can make a fair comparison.

  1. Change width/depth. Train these three alternatives: (2, 8, 1), (2, 64, 64, 1), (2, 32, 32, 32, 1). Compare the final train/validation loss and accuracy, and inspect whether the decision boundary becomes too simple or more flexible.

  2. Replace ReLU with tanh in the hidden layers. Keep the same baseline architecture (2, 32, 32, 1). Compare the loss curves, final validation accuracy, and whether optimization looks faster or slower in the first 50 to 100 epochs.

  3. Add L2 regularization to the dW updates. L2 regularization adds a penalty for large weights:

    Ltotal=LBCE+λ2Nl=1LW[l]22\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{BCE}} + \frac{\lambda}{2N}\sum_{l=1}^{L}\|W^{[l]}\|_2^2

    where

    W[l]22=i,j(Wij[l])2\|W^{[l]}\|_2^2 = \sum_{i,j}\left(W^{[l]}_{ij}\right)^2

    This discourages very large weights and usually makes the model smoother. In backpropagation, this adds an extra term to each weight gradient:

    LtotalW[l]=LBCEW[l]+λNW[l]\frac{\partial \mathcal{L}_{\text{total}}}{\partial W^{[l]}} = \frac{\partial \mathcal{L}_{\text{BCE}}}{\partial W^{[l]}} + \frac{\lambda}{N}W^{[l]}

    In words: after computing the usual gradient for each weight matrix, add a fraction of the weights themselves to that gradient. Bias gradients are usually left unchanged. For this task, use a slightly larger model so the effect is easier to observe: layer_sizes=(2, 64, 64, 64, 1). Compare two runs: lambda = 0.0 and lambda = 0.2. To track the overall weight size during training, add this helper function:

    def weight_l2_norm(params):
        L = len(params) // 2
        total = sum(np.sum(params[f"W{l}"] ** 2) for l in range(1, L + 1))
        return np.sqrt(total)

    Then add a new history entry in train_mlp:

    history = {
        "train_loss": [],
        "val_loss": [],
        "train_acc": [],
        "val_acc": [],
        "weight_norm": [],
    }

    After computing the metrics for an epoch, append the weight norm in the same part of the training loop where the other history values are stored:

    history["weight_norm"].append(weight_l2_norm(params))

    Compare training loss, validation loss, training accuracy, validation accuracy, and the weight norm curve. Check whether lambda = 0.2 keeps the weights smaller and changes the train/validation gap. Also compare the decision boundaries, how does lambda effect this?

  4. Implement early stopping on validation loss. Use the baseline settings first, then if early stopping never triggers, increase model capacity or epoch count. A good stress test is to try a larger model such as (2, 64, 64, 64, 1) with epochs=1000. Record the best validation epoch, the stopping epoch, and whether restoring the best parameters improves the final validation result.

  5. Stress test the noise level in make_moons_np. Try a few values such as 0.10, 0.24, and 0.44. Compare both the curves and the decision boundary, and describe how increasing noise changes smoothness, ambiguity, and overfitting risk.

Optional challenge: implement momentum optimizer and compare with plain SGD.