Deriving and Implementing a Neural Network From Scratch

October 20, 2022 3130 words 15 minutes

Contents

squirrel studying a neural network, AI-generated using DALL·E

Introduction

This blog post shows how to derive the equations to train a simple 2-layer neural network and how to implement them in Python using the NumPy library.

First, we’re going to tackle the math by deriving the equations we need and explicitly writing out the necessary matrices in the first and second sections of this blog post. This is done to reduce the likelihood of making errors. Keeping track of matrices is hard enough, and it can get even more confusing once we start taking derivatives with respect to matrices.

Once we’ve gotten the equations and notation straightened out, we’ll proceed to coding up the equations in the Python implementation section. That’ll be the easier part!

What’s the motivation for doing things from scratch? Even if you plan on using frameworks like PyTorch and TensorFlow, performing this exercise to understand neural networks and backpropagation is important because these frameworks are a leaky abstraction. We can be more effective at building and debugging neural networks if we understand what’s going on underneath (see Andrej Karpathy’s “Yes you should understand backprop”).

Defining the Neural Network

For this exercise, we’re going to use a simple 2-layer neural network that looks like this:

$$ \text{training data (input)} \rightarrow \text{layer 1 (hidden layer)} \rightarrow \text{layer 2 (output layer)} $$

Both the first and second layers will be linear functions with no activation. Note that an activation function is typically used (e.g., ReLU) between layers to introduce nonlinearity; in this exercise, we will be omitting it for simplicity.

Next, we’ll define the relevant matrices in the neural network for a specific case.

Let’s define the training data $\textbf{X}$ as a 2x2 matrix (2 training examples x 2 features) where the subscripts represent the position in the matrix:

$$ \textbf{X} = \left(\begin{matrix} X_{11} & X_{12} \\\ X_{21} & X_{22} \end{matrix}\right) $$

We will define $N$ as the number of training examples; in this case, $N=2$.

Layer 1 Equations

Let’s define the parameters (weights and biases) of Layer 1 where the superscript bracket represents the layer number:

$$ \textbf{W}^{[1]} = \left(\begin{matrix} W_{11}^{[1]} & W_{12}^{[1]} & W_{13}^{[1]} \\\ W_{21}^{[1]} & W_{22}^{[1]} & W_{23}^{[1]} \end{matrix}\right) $$

$$ \vec{b}^{[1]} = \left[\begin{matrix} b_1^{[1]} & b_2^{[1]} & b_3^{[1]} \end{matrix}\right] $$

$\textbf{W}^{[1]}$ is a 2x3 matrix of weights and $\vec{b}^{[1]}$ is a 3-element vector holding the bias values. During the computation of Layer 1 when $\textbf{X}$ (2x2 matrix) is matrix-multiplied by $\textbf{W}^{[1]}$ (2x3 matrix), the result is a 2x3 matrix. This means that Layer 1 has 3 hidden units or “neurons.” Then, that resulting 2x3 matrix will be summed element-wise with $\vec{b}^{[1]}$, which will first be broadcast (repeated) vertically from a 1x3 to a 2x3 matrix across the training examples.

Broadcasting is an important feature of the NumPy library in Python. If you’re not familiar with it, I recommend checking out NumPy’s broadcasting documentation.

Let’s define the output of Layer 1 (2x3 matrix):

$$ \textbf{Y}^{[1]} = \textbf{X} \textbf{W}^{[1]} + \vec{b}^{[1]} $$

Since $\textbf{Y}^{[1]}$ is a matrix, let’s explicitly write out its matrix elements:

$$ \textbf{Y}^{[1]} = \left(\begin{matrix} X_{11} W_{11}^{[1]} + X_{12} W_{21}^{[1]} + b_1^{[1]} & X_{11} W_{12}^{[1]} + X_{12} W_{22}^{[1]} + b_2^{[1]} & X_{11} W_{13}^{[1]} + X_{12} W_{23}^{[1]} + b_3^{[1]} \\\ X_{21} W_{11}^{[1]} + X_{22} W_{21}^{[1]} + b_1^{[1]} & X_{21} W_{12}^{[1]} + X_{22} W_{22}^{[1]} + b_2^{[1]} & X_{21} W_{13}^{[1]} + X_{22} W_{23}^{[1]} + b_3^{[1]} \end{matrix}\right) $$

Layer 2 Equations

Let’s define the parameters for Layer 2, where again the superscript bracket represents the layer number, and the subscripts represent the position of each element in the matrix or vector:

$$ \textbf{W}^{[2]} = \left(\begin{matrix} W_{11}^{[2]} \\\ W_{21}^{[2]} \\\ W_{31}^{[2]} \end{matrix}\right) $$

$$ \vec{b}^{[2]} = \left[\begin{matrix} b_1^{[2]} \end{matrix}\right] $$

We know that $\textbf{Y}^{[1]}$ is a 2x3 matrix, so when matrix-multiplied by $\textbf{W}^{[2]}$ (3x1 matrix), the result is a 2x1 matrix (1 output value per training example, with a total of 2 training examples). The resulting 2x1 matrix will be summed element-wise with $\vec{b}^{[2]}$, which will first be broadcast in order to match dimensions.

Let’s define the output of Layer 2 (2x1 matrix), which is also the output of the entire neural network:

$$ \begin{aligned} \hat{\textbf{Y}} &= \textbf{Y}^{[2]} \\\ &= \textbf{Y}^{[1]} \textbf{W}^{[2]} + \vec{b}^{[2]} \\\ &= (\textbf{X} \textbf{W}^{[1]} + \vec{b}^{[1]}) \textbf{W}^{[2]} + \vec{b}^{[2]} \end{aligned} $$

The above equation shows how we can take input data $\textbf{X}$, and using parameters $\textbf{W}$ and $\vec{b}$ for each layer, we can calculate the model’s output $\hat{\textbf{Y}}$. This is called a forward pass of the neural network.

Since $\hat{\textbf{Y}}$ is a matrix (2x1), let’s explicitly write out its matrix elements:

$$ \hat{\textbf{Y}} = \left(\begin{smallmatrix} (X_{11} W_{11}^{[1]} + X_{12} W_{21}^{[1]} + b_1^{[1]})W_{11}^{[2]} + (X_{11} W_{12}^{[1]} + X_{12} W_{22}^{[1]} + b_2^{[1]})W_{21}^{[2]} + (X_{11} W_{13}^{[1]} + X_{12} W_{23}^{[1]} + b_3^{[1]})W_{31}^{[2]} + b_1^{[2]} \\\ (X_{21} W_{11}^{[1]} + X_{22} W_{21}^{[1]} + b_1^{[1]})W_{11}^{[2]} + (X_{21} W_{12}^{[1]} + X_{22} W_{22}^{[1]} + b_2^{[1]})W_{21}^{[2]} + (X_{21} W_{13}^{[1]} + X_{22} W_{23}^{[1]} + b_3^{[1]})W_{31}^{[2]} + b_1^{[2]} \end{smallmatrix}\right) $$

Things are starting to look complicated — this is why we chose a simple case for this exercise.

Now we’ve finished defining our neural network and all of its associated matrices! In the next section, we’ll calculate the loss function, its gradients, and parameter updates.

Calculating The Gradients

The eventual goal is to be able to train our neural network, which means performing gradient descent (i.e., updating our weight and bias parameters with new values). These parameter updates require calculation of the partial derivative of the loss function with respect to each parameter. To calculate these gradients, we will work backwards, starting with the second layer and finishing with the first layer, which allows us to re-use already calculated quantities (i.e., backpropagation technique). This won’t make much of a difference for our shallow network in this exercise, but backpropagation’s efficiency is significant for deep networks.

To calculate the gradients, our strategy will be to work with single elements (scalars) at a time since they are easier to calculate, and then infer the general formula for the gradients from our simple case (see Justin Johnson’s “Backpropagation for a Linear Layer” PDF).

Let’s define the loss function:

$$ L = \frac{1}{2N}\sum_{i=1}^N (\hat{Y}_i - Y_i)^2 $$

We’re treating this exercise as a regression problem (i.e., output is a continuous quantity), so we’ve defined the loss function $L$ to be the mean squared error. $\hat{Y}_i$ represents the model’s output (i.e., prediction) and $Y_i$ represents the truth label for the $i$th training example. Note that the loss function returns a scalar value from summing over all $N$ training examples.

Let’s calculate the derivative of the loss function relative to the output because we’ll be using this quantity frequently in the backward pass:

$$ \frac{\partial L}{\partial \hat{\textbf{Y}}} = \left(\begin{matrix} \partial L / \partial \hat{Y}_1 \\\ \partial L / \partial \hat{Y}_2 \end{matrix}\right) $$

Note that $\frac{\partial L}{\partial \hat{\textbf{Y}}}$ has the same shape (2x1) as $\hat{\textbf{Y}}$ (model’s output) and $\textbf{Y}$ (truth label), as expected. Each $i$th matrix element is:

$$ \frac{\partial L}{\partial \hat{Y}_i} = \frac{1}{N} (\hat{Y}_i - Y_i) $$

Next, we will start calculating the gradients of the loss function relative to the parameters of each layer. Since we’re working backwards, we’ll start with Layer 2.

Layer 2 Gradients

Recall that Layer 2 has two parameters: $\textbf{W}^{[2]}$ and $\vec{b}^{[2]}$.

Let’s calculate the gradient for the weight:

$$ \frac{\partial L}{\partial \textbf{W}^{[2]}} = \left(\begin{matrix} \partial L / \partial W_{11}^{[2]} \\\ \partial L / \partial W_{21}^{[2]} \\\ \partial L / \partial W_{31}^{[2]} \end{matrix}\right) $$

We know that $\frac{\partial L}{\partial \textbf{W}^{[2]}}$ is a 3x1 matrix because it should be the same size as $\textbf{W}^{[2]}$. The sizes of the loss function gradients should always be the same as the parameters.

Let’s work out a single element, which is just a scalar:

$$ \begin{aligned} \frac{\partial L}{\partial W_{11}^{[2]}} &= \frac{\partial L}{\partial \hat{\textbf{Y}}} \frac{\partial \hat{\textbf{Y}}}{\partial W_{11}^{[2]}} \\\ &= \left(\begin{matrix} \partial L / \partial \hat{Y}_1 \\\ \partial L / \partial \hat{Y}_2 \end{matrix}\right) \left(\begin{matrix} X_{11} W_{11}^{[1]} + X_{12} W_{21}^{[1]} + b_1^{[1]} \\ X_{21} W_{11}^{[1]} + X_{22} W_{21}^{[1]} + b_1^{[1]} \end{matrix}\right) \\ &= \frac{\partial L}{\partial \hat{Y}_1} (X_{11} W_{11}^{[1]} + X_{12} W_{21}^{[1]} + b_1^{[1]}) + \frac{\partial L}{\partial \hat{Y}_2} (X_{21} W_{11}^{[1]} + X_{22} W_{21}^{[1]} + b_1^{[1]}) \end{aligned} $$

As you can see, we re-used $\frac{\partial L}{\partial \hat{\textbf{Y}}}$ (which was calculated earlier) and took the derivative of $\hat{\textbf{Y}}$ (also defined earlier) with respect to $W_{11}^{[2]}$. We got those two terms from the chain rule. Since we know that $\frac{\partial L}{\partial W_{11}^{[2]}}$ is a scalar, the multiplication of these two matrices must be a dot product, which is the sum of element-wise products. We’ll be applying this same logic as we solve for the gradients for the other parameters in the neural network.

Let’s repeat this process for the other two elements in the $\frac{\partial L}{\partial \textbf{W}^{[2]}}$ matrix:

$$ \begin{aligned} \frac{\partial L}{\partial W_{21}^{[2]}} &= \frac{\partial L}{\partial \hat{\textbf{Y}}} \frac{\partial \hat{\textbf{Y}}}{\partial W_{21}^{[2]}} \\\ &= \left(\begin{matrix} \partial L / \partial \hat{Y}_1 \\\ \partial L / \partial \hat{Y}_2 \end{matrix}\right) \left(\begin{matrix} X_{11} W_{12}^{[1]} + X_{12} W_{22}^{[1]} + b_2^{[1]} \\ X_{21} W_{12}^{[1]} + X_{22} W_{22}^{[1]} + b_2^{[1]} \end{matrix}\right) \\ &= \frac{\partial L}{\partial \hat{Y}_1} (X_{11} W_{12}^{[1]} + X_{12} W_{22}^{[1]} + b_2^{[1]}) + \frac{\partial L}{\partial \hat{Y}_2} (X_{21} W_{12}^{[1]} + X_{22} W_{22}^{[1]} + b_2^{[1]}) \end{aligned} $$

$$ \begin{aligned} \frac{\partial L}{\partial W_{31}^{[2]}} &= \frac{\partial L}{\partial \hat{\textbf{Y}}} \frac{\partial \hat{\textbf{Y}}}{\partial W_{31}^{[2]}} \\\ &= \left(\begin{matrix} \partial L / \partial \hat{Y}_1 \\\ \partial L / \partial \hat{Y}_2 \end{matrix}\right) \left(\begin{matrix} X_{11} W_{13}^{[1]} + X_{12} W_{23}^{[1]} + b_3^{[1]} \\ X_{21} W_{13}^{[1]} + X_{22} W_{23}^{[1]} + b_3^{[1]} \end{matrix}\right) \\ &= \frac{\partial L}{\partial \hat{Y}_1} (X_{11} W_{13}^{[1]} + X_{12} W_{23}^{[1]} + b_3^{[1]}) + \frac{\partial L}{\partial \hat{Y}_2} (X_{21} W_{13}^{[1]} + X_{22} W_{23}^{[1]} + b_3^{[1]}) \end{aligned} $$

Put it all together:

$$ \frac{\partial L}{\partial \textbf{W}^{[2]}} = \left(\begin{matrix} \frac{\partial L}{\partial \hat{Y}_1} (X_{11} W_{11}^{[1]} + X_{12} W_{21}^{[1]} + b_1^{[1]}) + \frac{\partial L}{\partial \hat{Y}_2} (X_{21} W_{11}^{[1]} + X_{22} W_{21}^{[1]} + b_1^{[1]}) \\\ \frac{\partial L}{\partial \hat{Y}_1} (X_{11} W_{12}^{[1]} + X_{12} W_{22}^{[1]} + b_2^{[1]}) + \frac{\partial L}{\partial \hat{Y}_2} (X_{21} W_{12}^{[1]} + X_{22} W_{22}^{[1]} + b_2^{[1]}) \\\ \frac{\partial L}{\partial \hat{Y}_1} (X_{11} W_{13}^{[1]} + X_{12} W_{23}^{[1]} + b_3^{[1]}) + \frac{\partial L}{\partial \hat{Y}_2} (X_{21} W_{13}^{[1]} + X_{22} W_{23}^{[1]} + b_3^{[1]}) \end{matrix}\right) $$

Examining this matrix, we note that it can be constructed by simply taking the transpose of $\textbf{Y}^{[1]}$ and matrix-multiplying it by $\frac{\partial L}{\partial \hat{\textbf{Y}}}$. I recommend working it out to convince yourself that it is indeed the case. Note that $\textbf{Y}^{[1]^\mathsf{T}}$ is a 3x2 matrix and $\frac{\partial L}{\partial \hat{\textbf{Y}}}$ is a 2x1 matrix, so the result is a 3x1 matrix, as seen above. This strategy where we work out a simple case and then come up with a general form is what we’ll be doing for the other parameters as well.

Write out the general equation:

$$ \boxed{\frac{\partial L}{\partial \textbf{W}^{[2]}} = \textbf{Y}^{[1]^\mathsf{T}} \frac{\partial L}{\partial \hat{\textbf{Y}}}} $$

Now that we have this generalized equation, we’ll use it to update $\textbf{W}^{[2]}$ during gradient descent. Note that this general equation works for any dimensions of the input data and number of hidden units, even though we derived it using a specific case. Next, we’ll repeat this process for the bias parameter.

Let’s calculate the gradient for the bias:

$$ \frac{\partial L}{\partial \vec{b}^{[2]}} = \left(\begin{matrix} \frac{\partial L}{\partial b^{[2]}_1} \end{matrix}\right) $$

Work on one element at a time (in this case, there is only 1 element):

$$ \begin{align} \frac{\partial L}{\partial b^{[2]}_1} &= \frac{\partial L}{\partial \hat{\textbf{Y}}} \frac{\partial \hat{\textbf{Y}}}{\partial b^{[2]}_1} \\ &= \left(\begin{matrix} \partial L / \partial \hat{Y}_1 \\ \partial L / \partial \hat{Y}_2 \end{matrix}\right) \left(\begin{matrix} 1 \\ 1 \end{matrix}\right) \\ &= \sum_{i=1}^{N} \frac{\partial L}{\partial \hat{Y_i}} \end{align} $$

Put it all together:

$$ \boxed{\frac{\partial L}{\partial \vec{b}^{[2]}} = \sum_{i=1}^{N} \frac{\partial L}{\partial \hat{Y_i}}} $$

Layer 1 Gradients

As done for Layer 2, we’ll repeat the same process to figure out the general equations for the weight and bias parameters in Layer 1.

Let’s calculate the gradient for the weight:

$$ \frac{\partial L}{\partial \textbf{W}^{[1]}} = \left(\begin{matrix} \partial L / \partial W_{11}^{[1]} & \partial L / \partial W_{12}^{[1]} & \partial L / \partial W_{13}^{[1]} \\ \partial L / \partial W_{21}^{[1]} & \partial L / \partial W_{22}^{[1]} & \partial L / \partial W_{23}^{[1]} \end{matrix}\right) $$

Work on a single element at a time:

$$ \begin{align} \frac{\partial L}{\partial W_{11}^{[1]}} &= \frac{\partial L}{\partial \hat{\textbf{Y}}} \frac{\partial \hat{\textbf{Y}}}{\partial W_{11}^{[1]}} \\\ &= \left(\begin{matrix} \partial L / \partial \hat{Y}_1 \\\ \partial L / \partial \hat{Y}_2 \end{matrix}\right) \left(\begin{matrix} X_{11}W_{11}^{[2]} \\ X_{21}W_{11}^{[2]} \end{matrix}\right) \\ &= \frac{\partial L}{\partial \hat{Y_1}} X_{11}W_{11}^{[2]} + \frac{\partial L}{\partial \hat{Y_2}} X_{21}W_{11}^{[2]} \end{align} $$

And we can repeat this calculation for all of the other elements in the matrix.

Generalize the equation:

$$ \boxed{\frac{\partial L}{\partial \textbf{W}^{[1]}} = X^\mathsf{T} \frac{\partial L}{\partial \hat{\textbf{Y}}} \textbf{W}^{[2]^\mathsf{T}}} $$

Since $X^\mathsf{T}$ is a 2x2 matrix, $\frac{\partial L}{\partial \hat{\textbf{Y}}}$ is a 2x1 matrix, $\textbf{W}^{[2]^\mathsf{T}}$ is a 1x3 matrix, the result is a 2x3 matrix, as expected.

Just like before, I recommend working it out to convince yourself that the general equation indeed produces the matrix we expect.

Let’s calculate the gradient for the bias:

$$ \frac{\partial L}{\partial \vec{b}^{[1]}} = \left(\begin{matrix} \frac{\partial L}{\partial b^{[1]}_1} & \frac{\partial L}{\partial b^{[1]}_2} & \frac{\partial L}{\partial b^{[1]}_3} \end{matrix}\right) $$

Work out a single element at a time:

$$ \begin{align} \frac{\partial L}{\partial b^{[1]}_1} &= \frac{\partial L}{\partial \hat{\textbf{Y}}} \frac{\partial \hat{\textbf{Y}}}{\partial b^{[1]}_1} \\ &= \left(\begin{matrix} \partial L / \partial \hat{Y}_1 \\ \partial L / \partial \hat{Y}_2 \end{matrix}\right) \left(\begin{matrix} W_{11}^{[2]} \\ W_{11}^{[2]} \end{matrix}\right) \\ &= \frac{\partial L}{\partial \hat{Y_1}} W_{11}^{[2]} + \frac{\partial L}{\partial \hat{Y_2}} W_{11}^{[2]} \end{align} $$

Generalize the equation:

$$ \boxed{\frac{\partial L}{\partial \vec{b}^{[1]}} = \left( \sum_{i=1}^{N} \frac{\partial L}{\partial \hat{Y_i}} \right) \textbf{W}^{[2]^\mathsf{T}} } $$

Notice that $\frac{\partial L}{\partial \vec{b}^{[1]}}$ re-uses 2 terms that were previously calculated: the summation term and the $\textbf{W}^{[2]^\mathsf{T}}$ term. Since the summation term results in a scalar and the $\textbf{W}^{[2]^\mathsf{T}}$ term is 1x3 matrix, the result is a 1x3 matrix, as expected.

Now we’re finally done deriving all of the equations we need! Let’s move on to the implementation part.

Implementing in Python

We’ll be using the NumPy library in Python to perform matrix operations and to generate random numbers for this exercise. I used a Jupyter notebook so that I could easily check the outputs of various variables as I wrote the code.

Here’s the syntax I recommend for the two types of multiplication we’ll be using:

matrix multiplication: use @ (example: matrix1 @ matrix2)
element-wise multiplication: use * (example: matrix1 * matrix2)

I don’t recommend using np.dot() for matrix multiplication because that function has different behaviors depending on the input array shapes. It can perform a dot product, matrix multiplication, or element-wise multiplication depending on the array shapes. I think it’s clearer to use specific notation (such as @ or *) to designate different operations. See NumPy’s np.dot() documentation.

Let’s import the modules we’ll need, and define a random number generator rng:

import numpy as np
from numpy.random import default_rng

rng = default_rng(seed=42)

Generate random numbers for the input data matrix X, weight matrices W1 and W2, and bias vectors b1 and b2, where 1 and 2 denote the layer number:

X = rng.standard_normal((2, 2))
W1 = rng.standard_normal((2, 3))
b1 = rng.standard_normal((1, 3))

W2 = rng.standard_normal((3, 1))
b2 = rng.standard_normal((1, 1))

Y_pred = (X @ W1 + b1) @ W2 + b2
Y = rng.standard_normal((2, 1))
N = len(X)

Y_pred, Y, N

(array([[ 0.39437039],
        [-0.17606193]]),
 array([[-0.9588826],
        [ 0.8784503]]),
 2)

You can see that the parameters have been initialized to the same sizes as those in our derivations above. I printed the output of three variables: Y_pred is the predicted output from the model (via a forward pass), Y represents the truth labels, and N is the number of samples.

Calculate the loss function:

loss = np.sum(np.square(Y_pred - Y)) / (2 * N)
loss

0.7358224249837203

This is the same loss function that we used earlier in our derivations: the mean squared error. As expected, the loss function returns a scalar value. During training, our goal will be to reduce this loss value.

Calculate the loss function derivative with respect to the output dL_dY, as well as the intermediate output of Layer 1 Y1_pred:

dL_dY = (Y_pred - Y) / N
Y1_pred = X @ W1 + b1

dL_dY, Y1_pred

(array([[ 0.67662649],
        [-0.52725612]]),
 array([[ 0.6137715 ,  0.39846854,  0.99213798],
        [-0.88220534, -0.21523281, -0.64037434]]))

As we know from working through the math in earlier sections, both of these matrices dL_dY and Y1_pred will be needed in future gradient calculations, so it’s helpful to calculate them once and re-use as needed. Note that I’ve been printing the values of matrices as we go along, so that we can inspect that their shapes are indeed what we expect (according to our earlier equations).

Now it’s time to implement the gradient equations that we derived earlier.

Calculate the gradients for Layer 2:

dL_dW2 = Y1_pred.T @ dL_dY
dL_db2 = np.sum(dL_dY, axis=0, keepdims=True)

dL_dW2, dL_db2

(array([[0.88044222],
        [0.38309718],
        [1.00894813]]),
 array([[0.14937038]]))

Matrix Summation: Whenever you’re summing over a specific axis in a matrix using np.sum(), it’s generally a good practice to set keepdims=True in order to keep the same tensor rank. Otherwise, your resulting mx1 matrix might actually be a rank-1 tensor with a shape of (m,) instead of a rank-2 tensor with a shape of (m,1). It’s best to keep everything consistent as rank-2 tensors.

Calculate the gradients for Layer 1:

dL_db1 = dL_db2 * W2.T
dL_dW1 = X.T @ dL_dY @ W2.T

dL_db1, dL_dW1

(array([[ 0.16837644,  0.06983205, -0.12835284]]),
 array([[-0.21361259, -0.08859318,  0.16283621],
        [-1.35223776, -0.56082388,  1.03080664]]))

Everything looks good so far, and the matrix sizes (a common source of error) are all as expected. Now we’re confident enough to set up a loop where we will train our model. During each iteration, we’ll perform a forward pass, calculate the loss and print it to screen, calculate the gradients, and update the parameters according to the gradients. I’ll also define a learning rate lr that controls how big of a step is taken during gradient descent. Everything in this loop should be familiar because we just walked through each step and saw its output.

Run the training loop:

epochs = 100
lr = 0.01

for epoch in range(epochs):

    # Forward pass
    Y_pred = (X @ W1 + b1) @ W2 + b2

    # Loss calculation
    loss = np.sum(np.square(Y_pred - Y)) / (2 * N)

    # Print progress
    print(epoch, loss)

    # Gradient calculations
    dL_dY = (Y_pred - Y) / N
    Y1_pred = X @ W1 + b1
    dL_dW2 = Y1_pred.T @ dL_dY
    dL_db2 = np.sum(dL_dY, axis=0, keepdims=True)
    dL_db1 = dL_db2 * W2.T
    dL_dW1 = X.T @ dL_dY @ W2.T

    # Update parameters
    W1 -= lr * dL_dW1
    b1 -= lr * dL_db1
    W2 -= lr * dL_dW2
    b2 -= lr * dL_db2

0 0.7358224249837203
1 0.6838444281365352
2 0.6356866069048248
3 0.5910284773849694
4 0.5495834266636437
5 0.5110941899865947
...
95 0.000445617850105421
96 0.0004105173016214235
97 0.00037817297218543515
98 0.00034836942798990937
99 0.0003209079596407881

Success! Our model appears to be training well, with the loss decreasing over time.

To summarize, we’ve shown how to define a forward pass and derived equations for loss function gradients for a simple 2-layer neural network. We’ve implemented these equations in Python using the NumPy library, and successfully trained our model. Hopefully, you’ve found this to be a useful exercise that gave you a deeper intuition of how neural networks work!