Neural Networks: Solving Problems, the Human Brain Way

Part 2/3: Introduction to AI, ML and now, Deep Learning

Rania Hashim
8 min readDec 20, 2022

--

In the previous article, I had briefly touched upon the many wonders of the way our brain works, including how effortlessly our brain can read handwritten digits:

Computers see this as a weird amalgamation of pixels. What makes this worse is that each number can be written in many different ways and so storing the order of pixels will not really be of much use.

Tasks that required complex thinking were previously considered to be something AI couldn’t achieve as it required human thought process.

We maintained the status quo until someone thought, “Hey, what if we mimic the thought process of a human brain?”.

And just like that, the idea of artificial neural networks was born.

Our brain has many neurons that transmit and process information they receive by shooting electrical signals via neurotransmitters. The neurons are simple processors of information and alone, they are terribly useless and largely unimpressive. As a network however, the neurons form a complex system, being the key to functions such as memory and learning.

The key principle behind a neural network is to emulate this to achieve similar results. We seek inspiration from our biology to build better AI/ML techniques.

Just like our brain, neural networks contain artificial neurons or nodes which are simple units that store and process information. It contains many layers of neural nets that form a network (hence the name ‘neural network’).

This is an artificial neural network:

From 3blue1brown

This neural network is used to identify images of handwritten digits as the numbers they represent. It has 4 layers to do so. The 1st layer is called the input layer, the 2nd and 3rd are called the hidden layers and the 4th is called the output layer.

Our input is going to look something like this:

a 28x28 pixel image of the number 7 from the MNIST database

This is a 28x28 pixel image of a number. The objective of the neural network is to correctly identify this as 7.

Okay, so we feed it into the neural network. Now what?

The first layer contains 784 neurons, corresponding to each pixel found in the image. Each neuron contains a value between 0 and 1 that represents the grayscale value of each pixel. This value is called its activation.

Flash forward to the last layer and we find that it contains 10 neurons, each representing a number. The activations of the neurons of this layer represent the network’s confidence in the output (i.e. how much it believes that the given image corresponds to the output digit).

The hidden layers are really what work the magic.

Setting things off: Getting from Input to Output

Going back to the brain, we know that the firing of certain neurons brings about the firing of certain other neurons. Similarly, the pattern of activations of the first layer will cause a specific pattern of activations on the second layer and so on.

The activations on the second layer is going to depend on 3 things:

  • activations of the first layer
  • weights
  • bias

Weights

While the neurons between layers are connected to each other, the strength of these connections vary. Each of these connections are assigned a ‘weight’. The value of the weight determine the priority or importance we give to the previous node.

For instance, when you are choosing locations to host a Christmas party for you and your friends, you might look for three things; price, availability of free brunch and the activities offered. Your main priority might be price. So, we assign the highest weight to price. At the same time, you may not care much about free brunch (imagine not caring about free food 🗿), and so we would assign the lowest weight to that.

With the help of the above example, it becomes clear that weights essentially tell us how much each connection matters.

Hence, to compute the activation of a neuron in the second layer, we would first consider the activations of the first layer as well as the weights. We calculate the weighted sum of the activations in the first layer (i.e. ΣWₙAₙ).

Bias

You may have heard the term ‘bias’ alot, especially when we discuss social issues. In neural networks, the bias is a constant that we add to the weighted sum that we found earlier.

We add this because we want the neuron to activate meaningfully. The biases given nudges towards what you expect the output to be. In a way, it is some indication of whether that neuron tends to be active or not.

As aforementioned, activations can only take up values between 0 and 1. Obviously, the weighted sum of all the activations + the bias is obviously not going to be a value between 0 and 1.

What we want to do here is to “compress” the value to the range of 0 and 1. To do so, we may use different functions like the sigmoid function or ReLu.

Our final activation will then be:

Something important to note here is that initially, the weights and biases take on random values.

As you can infer from the equation, weights and biases are essentially the dials and knobs to the final output. We can tweak the neuron to pick up on different things by adjusting the values of these two.

This happens between all the other layers until the output layer, where the neuron with the highest activation will represent the final output.

That’s not to say that the final output will be right though…

How Neural Networks Learn

That’s right — while the neural network above may give a random output, it wont exactly be the right one. We need to train the network using training data so that it can correctly identify the images.

We feed the neural network 1000s of images with the label (the number it corresponds to) and based on the feedback, it would adjust its weights and biases.

Next, we feed the network a bunch of other data that its never seen before and see how accurately it classifies them.

To evaluate how well the model works, we define a cost function.

A cost function is essentially a way of saying, “you suck, you goofy ahh machine”. It is an evaluation of the performance of the model and the differences in its value tells us alot about how it has improved.

There are many methods used in defining the cost function, but one that is most commonly used is Mean Squared Error (MSE). In this method, we compare the value of the actual activation value with the correct/expected activation value.

C(w,b)=
Σ(Actual Activation Value — Expected/Correct Activation Value)²/n

We want the cost function to be minimum as a function of the weights and biases as that would mean that the network is pretty efficient and does what it is meant to do.

Just knowing how much of a crappy job our neural network is doing won’t change anything though. We need to help the network learn.

We can start by imagining C as a function of 2 variables.

Source: neuralnetworksanddeeplearning.com

What we’d want to find here is where C achieves global minima. That may seem easy in this case, but keep in mind, when we apply this to a neural network we may have billions of weights and biases.

One way to find out the minima is to picture our function as a valley (analogy). We can imagine a ball rolling down the slope and would reach the bottom of the valley. The bottom of the valley would be our minima.

We want to know which direction the ball should roll towards in order to decrease the function the most. For this, we could consider the negative of the gradient of a function (the gradient of a function gives us the direction of the steepest ascent).

In that case, we’d find out this value and then take a step in that direction and repeat. This is known as gradient descent.

Note: The training data is divided into mini-batches and each step is taken with respect to a mini-batch. The reason for doing so is because computing a gradient descent for all training examples and averaging the changes isn’t practical.

When we look at the values of the gradient descent, we’d notice 2 things; the direction and relative magnitudes. The sign of the value would tell us if the value is to increase or decrease whereas the relative magnitudes would indicate the changes that matter more.

Backpropagation is the algorithm that uses gradient descent to fine tune the weights to minimize cost function towards a local minimum.

As the name suggests, here, we work backwards. We start with the output layer and look at each of the nodes. We keep track of the gaps between expected value and actual value — noting down the adjustments we want to see in the output node. This would essentially translate to adjustments in the weights, biases and previous-layer activations, giving us a list of changes we require in the (n-1)th layer.

We repeat the same process moving backwards from the 3rd layer to the 2nd. It is done for multiple examples and the average is taken. This value is proportional to the gradient descent.

After the changes are implemented, there you go! You got yourself a working neural network that accurately classifies images to their corresponding numbers!

Hey 👋, I’m Rania, a 16 y/o activator at the Knowledge Society. I’m a future of food researcher who focuses on acellular agriculture. Currently, I’m nerding out on using artificial intelligence for education. I’m always ready to learn, grow and inspire. I’d love to connect; reach out to me on any of my social media and let’s be friends!

--

--