The “Hello World” of Computer Vision

Part 3/3: Building a neural network for the MNIST dataset to develop intuitive understanding

8 min readMar 13, 2023

In the last article, we learnt how a neural network works and the essential mathematical operations that go into spitting out an accurate outcome. Now, its time to get some of that action and build one yourself!

Our Dataset for Today: MNIST Dataset

You might have noticed me throwing around the handwritten digits example in the previous parts of this series. These digits are actually from the MNIST dataset, a large database of low resolution handwritten digits used to train AI as shown below.

Our main objective in building this 2-layer neural network is to classify these handwritten digits into the corresponding number. Success in this project can be defined as an 80%+ accuracy in digit classification.

In order to build this neural network, I used Samson Zhang’s YouTube tutorial for guidance — would highly recommend!

Alright, now let's get started!

The Code

As you might have guessed, we will be using Python for this project.

We start by making necessary imports — the packages (numpy, pandas, pyplot) and the data which we are going to be training our neural network on:

import numpy as np #for linear algebra and related operations
import pandas as pd #for dealing with matrices
from matplotlib import pyplot as plt #for showing visuals

data = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')

As aforementioned, the data imported is basically the handwritten digits. This is in the form of a 28 x 28 (784 pixels) low resolution image and the greyscale value of each pixel is rated on a scale of 1–10.

Quick heads-up before we move further, I explain some of the code using comments in the code— make sure to check it out!

Organizing our Data

Now that the data has been imported, we have to do some work on it before we can start using our fancy schmancy linear algebra on it.

#over here we're basically organizing the data 

data = np.array(data) #organizing it as an array of data
m, n = data.shape #m is the amount of rows (i.e examples), n is the number of columns (i.e number of pixels + 1) in the matrix
np.random.shuffle(data) #data is randomly shuffled to be organized into training and dev data

#we organize the data into 2 sets - training set and dev set. this is to avoid OVERFITTING.

data_dev = data[0:1000].T #first 1000 values, transpose of the matrice taken for convenience
Y_dev = data_dev[0]
X_dev = data_dev[1:n] 

data_train = data[1000:m].T #1000 to end value, #transpose of the matrice taken for convenience
Y_train = data_train[0]
X_train = data_train[1:n]

The data is organized into 2 sets to avoid overfitting. Overfitting is basically when the model is able to give accurate predictions for the training data but not unseen data. This essentially defeats the purpose of the model, rendering it unable to make accurate predictions for new data. Hence, we shuffle the data and then classify it into 2 sets.

Working the Parameters and Defining some Functions

def init_params():
    W1 = np.random.rand(10, 784) #this basically selects a random value in the given shape
    b1 = np.random.rand(10, 1) 
    W2 = np.random.rand(10, 10) # values for second layer
    b2 = np.random.rand(10, 1)
    return W1, b1, W2, b2

Here, we define the weights and biases for the neural network. If you recall from the previous article, these are essentially the dials and knobs that make your neural network what it is. We start out with random values which is why we use the np.random.rand() operation.

The values will take more shape and become increasingly accurate as it receives feedback from the training data.

At this stage, you will also want to define activation functions. Here, we will be using ReLU and softmax.

ReLU is essentially an activation function that takes on only positive values. In the case of negative numbers, it will be 0 as shown in the graph above.

def ReLU(Z):
    return np.maximum(Z, 0)

Softmax is another activation function and is defined as shown below. This will be used in the second layer of the network, unlike ReLU which will be used in the first.

def softmax(Z):
    A = np.exp(Z) / sum(np.exp(Z)) #each element divided by sum for each column across all of the rows
    return A

This is essentially the code for the below formula:

Forward Propagation

Now that you have your parameters, its time to initiate forward propagation!

def forward_prop(W1, b1, W2, b2, X): #where X is the values of the first layer on which the defined weights and biases will be applied
    Z1 = W1.dot(X) + b1 #.dot is used for matrix multiplication as both are of the form of arrays
    A1 = ReLU(Z1)
    Z2 = W2.dot(A1) + b2
    A2 = softmax(Z2)
    return Z1, A1, Z2, A2

Here, X is going to be the values of the input layer. What we want to calculate first would be Z1, which is the values that take on in the next layer (the first layer after the input layer).

We do this by multiplying the weights with the values of the first layer (W1.dot(X)) and add in our bias.

We apply an activation function and proceed with the next layer as well.

One-Hot Encoding

Great, so now we’re done with our forward propagation. What’s next?

This is where one hot encoding comes here. We want to encode the correct label for the data in the array.

def one_hot(Y):
    one_hot_Y = np.zeros((Y.size, Y.max() + 1)) #creates a correctly sized matrix 
    one_hot_Y[np.arange(Y.size), Y] = 1 #the right label/cell takes on the value of 1
    one_hot_Y = one_hot_Y.T
    return one_hot_Y

What one hot encoding does, is that it converts a categorical variable into a binary vector. For instance, take a look at the below image:

Initially, each id belonged to a category based on its color. What one-hot encoding does is express this in a different way (vector form). If the color matches with the label, it is assigned a value of 1. If it doesn’t, it is assigned a 0. This is done not only for one colour, but multiple.

Similarly, in our case, we want to be able to encode the label of the corresponding digit to a handwritten digit.

Backward Propagation

Initially, we had taken a random value for the weights and biases. Now, it's time to work backwards and fix it up.

def backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y):
    one_hot_Y = one_hot(Y)
    dZ2 = A2 - one_hot_Y #activations of the second layer - expected value
    dW2 = 1 / m * dZ2.dot(A1.T) #over here, we are trying to find out by how much the initial weights and biases was off so that we can nudge it by the right amounts 
    db2 = 1 / m * np.sum(dZ2)
    dZ1 = W2.T.dot(dZ2) * ReLU_deriv(Z1) #how much the first layer was off by, taking the error and applying the weights in reverse
    dW1 = 1 / m * dZ1.dot(X.T)
    db1 = 1 / m * np.sum(dZ1)
    return dW1, db1, dW2, db2

Here, we work backwards from the output layer and try to compute the deviations from the expected value. This value is taken into consideration in order to nudge the weights and biases. In the case of the first layer, we’ll also need to undo the ReLU function to get the exact value from the second layer. This is why we define the ReLU_deriv function:

def ReLU_deriv(Z):
    return Z > 0

#note: make sure to put this code BEFORE backward_prop

We update the parameters by using the method shown below:

def update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha):
    W1 = W1 - alpha * dW1  #alpha is the learning rate set by us
    b1 = b1 - alpha * db1    
    W2 = W2 - alpha * dW2  
    b2 = b2 - alpha * db2    
    return W1, b1, W2, b2

The Final Verdict: What is the Number?

Now, this is not a linear process. We do not just do forward propagation, backward propagation and update our parameters once and call it a day. We repeat the process to improve its accuracy until our weights and biases look like more than just random guesses.

Here is where we define gradient descent:

def get_predictions(A2):
    return np.argmax(A2, 0) #maximum values of the output layer (i.e. the final prediction)

def get_accuracy(predictions, Y):
    print(predictions, Y)
    return np.sum(predictions == Y) / Y.size

def gradient_descent(X, Y, alpha, iterations):
    W1, b1, W2, b2 = init_params() #returns the initial values
    for i in range(iterations):
        Z1, A1, Z2, A2 = forward_prop(W1, b1, W2, b2, X)
        dW1, db1, dW2, db2 = backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y) 
        W1, b1, W2, b2 = update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha) #updating the weights
        if i % 10 == 0: #every 10th iteration
            print("Iteration: ", i)
            predictions = get_predictions(A2) #predictions from forward prop, get accuracy
            print(get_accuracy(predictions, Y))
    return W1, b1, W2, b2

W1, b1, W2, b2 = gradient_descent(X_train, Y_train, 0.10, 500) #with a learning rate of 0.10 and 500 iterations

Finally, we test out the weights and biases and use the neural network on the actual training data.

The only thing remaining is to click run and boom, you got yourself a neural network with 80%+ accuracy! To be specific, mine was 82% accurate which I would consider pretty good 🦾

You can play around with your neural network, try tweaking around some values and see how much of an impact it creates on your final accuracy. Feel free to experiment and learn more :)

Anyways, that brings us to the end of this series; thanks for sticking around and I hope you learnt something new!

Hey 👋, I’m Rania, a 16 y/o activator at the Knowledge Society. I’m a future of food researcher who focuses on acellular agriculture. Currently, I’m nerding out on using artificial intelligence for education. I’m always ready to learn, grow and inspire. I’d love to connect; reach out to me on any of my social media and let’s be friends!