Machine Learning

Neural Networks Cover Art

A detailed beginner's guide to Machine Learning

Reinventing the Wheel

Building things from scratch teaches you a lot of 'under the hood' details, which you would have otherwise. These little details are invaluable for a deeper understanding. Here's what I learnt by implementing entire Neural Network model, using only Python & its library Numpy.

August 2023

THE PROJECT

Using only Python & NumPy, I've implemented a Neural Network, and trained it to recognize Handwritten Digits from MNIST Dataset. It reached 94.8% accuracy within 4 minutes of training.

Another Neural Network is trained to recognize English alphabets, from a custom dataset with 48 typeface variations, 2 size variations per typeface, and 27 uppercase & lowercase English alphabets. It reached 75.5% accuracy within 30 minutes of training.

THE ARTICLE

Many resources demand a strong Calculus background for getting started with Machine Learning. While its certainly helpful, its not a strict requirement in my opinion. The key concepts of Calculus used in Machine Learning, can indeed be explained without a deeper Calculus background.

With this Case Study, I aim to explain what I learnt in an effective & simple manner.

Breakdown

Understanding Neural Networks

Before we get into the details, let's define some fundamental concepts from Machine Learning.

FUNCTIONS

A function can be described as an operation applied to an input. The result of the operation is the output of the function.

Everything can be described as a function. The images on your screen, the sounds you hear, fourier transform, mathematical multiplication, and more.

Polynomial as a Function

Polynomial as a Function

NEURAL NETWORKS

Neural networks are composed of interconnected neurons (also known as perceptrons). These neurons are the foundation of machine learning. The more neurons a network has, the more capable it is to learn complexity. (I'll expand on it later)

Sample Neural Network

Sample Neural Network

THE NEURON

Neurons can basically be defined as a function, that takes multiple input values, and produces a single output. Each connection to the neuron has an associated weight, determining its importance (influence towards the output). Additionally, neurons also consist of a bias value, which shifts the output. The result is then passed through an activation function, which yields the neuron's final output.

Neuron as a Function

Neuron as a Function

Common activation functions are Sigmoid, Tanh, and ReLU. This will be explained in detail later. For now, you can think of the activation function as sensitivity of the neuron.

Function Approximators

Neural Networks are essentially function approximators. Given a set of sample data (input & output), and without knowing the definition of the function, you can train a sufficient neural network to approximate it. Different types of functions require different structures of neural networks. A Linear Function can be learnt by a single-layer neural network, but it cannot learn a non-linear function, because the network itself only performs linear calculations. The best example of this phenomenon is the OR and XOR functions.

A single layer neural network can easily learn the OR function.

Neural Network learns OR Function

Neural Network & OR Function

However, the same network won't be able to learn the XOR function, since its not Linearly Separable. A network with 2 layers can learn it, since the second layer introduces non-linearity to the network.

Neural Network learns XOR Function

Neural Network & XOR Function

Activation functions

Activation functions are a core part of neural networks. As explained earlier, non-linearity is important in a network, so that it can approximate functions with increased complexity. Activation functions are the best way to add non-linearity to a neural network. The most commonly used activation functions are:

Common Activation Functions

Common Activation Functions

Graph of Activation Functions Network

Graph of Activation Functions

SOFTMAX FUNCTION

Softmax is a unique activation function. It converts continuous values into a probability distribution, which is used for classification Neural Networks and Transformers for predictions.

Optimizers

The Learning in Machine Learning

We can finally explore how a neural network actually learns. Consider a single-layered neural network with one input which can be mathematically defined as: (F(x) = wx + b)

Neural Network Function

Neural Network as a Function

Given a list of sample inputs & their corresponding outputs (the dataset), without knowing the function, our goal is to tweak the parameters weight (w) and bias (b) such that function's ouput matches given output. In this simple example, we can treat the parameters as unknowns, and manually solve the equation. But we need a systematic approach, which scales well with larger neural networks.

Instead, lets use a state space, which is a fancy word for plotting all possible combinations of weights & biases for analysis. Plotting weight on X-axis and bias on Y-axis, we get the following graph:

Empty State Space

Empty State Space

Now, we can fill the graph with a quantity known as loss. It can be defined as the difference in the output of the network and the desired output. Usually Mean Squared Error is used as a loss function. If we plot the loss for each combination of weights & biases, we get the following state space:

In this 3D representation of all possibilities, we can see a landscape with a distinct "valley" in blue. This represents the target values, that yield the least error. With that combination, we get the best approximation. We cannot use this analysis technique for larger networks, especially when involving multiple layered neural networks, because we would need hyper dimensional space to visualize this. The question then becomes, how do we represent this technique mathematically.

Thats where Optimizers come in. These are functions which find the most optimal parameter values, that yield the best approximation.

Stochastic Gradient Descent

The most common optimizer is Stochastic Gradient Descent (or SGD). Instead of analyzing all possibilities (the entire state space), SGD only calculates the slope at the current position. This is given by the derivative of our function. Derivative of a function is basically the slope of that function, and tells us how the output will change, with respect to a change in our inputs.

Neural Network Function's Derivative

Neural Network Function's Derivative

Stochastic Gradient Descent

SGD Approximating an Image

SGD Approximating an Image

Adam Optimizer

Adam optimizer is an advanced optimization algorithm, which improves upon SGD by adopting momentum gradient descent. Instead of stepping exactly the same amount in each time, it accumulates momentum from previous iterations. Adam optimizer is among the fastest optimizers.

Adam Optimizer

Adam Approximating an Image

Adam Approximating an Image

Evolution

Another, but relatively uncommon technique, is the evolutionary algorithm, inspired by biological evolution. In each iteration, multiple samples of the current function are created, and are called a generation combined. Each sample is then mutated, which means its parameters are randomly tweaked. The best sample from the generation is found by finding the one with minimal loss, which is selected for next iteration.

Evolutionary Optimizer

Optimizing the Optimizers

Lets discuss some standard techniques, which optimize the learning of the Neural Network, either in terms of learning speed, efficiency, or accuracy.

EPOCHS AND BATCHES

Traditionally, a neural network is trained multiple times on the entire dataset. Each iteration is called an Epoch. The faster the loss is decreased, the faster the network learns.

Traditional Training Approach

Traditional Training Approach

Alternatively, the dataset can be divided into set of batches, and the neural network can be trained on each batch, one at a time. Per iteration, the loss function is calculated only across the current batch. This is much faster than the traditional approach. The state-space changes per iteration (changing the landscape). The landscape in each batch has different local minimums (small valleys, but more elevated than the global minimum valley). But as all of the batches are optimized, the differences are cancelled out, and the SGD is determined to converge to the global minimum, without getting stuck in local minimums.

This requires an alternative optimizer, Mini-Batch Gradient Descent (MBGD). I won't go into the details, but it takes into account the loss of only the current batch to calculate the gradients.

Changing State-Space over Batches

Changing State-Space over Batches

This approach focuses on convergence rather than speed. It ensure that a sufficient Neural Network will always converge to global minimum, instead of getting stuck. Its actually slower than traditional approach but more stable.

Training on Batches of Input

Training on Batches of Input

LEARNING RATE

As discussed earlier, SGD computes the gradient of the neuron's function, and steps in that direction with a discrete step. This step is called the learning rate.

If we decrease the learning rate over time, the network will, theoretically, be able to fine-tune the details of the function we're trying to approximate. Decay too fast, the network will not be able to reach sufficient accuracy in expected time. Decaying too quick, the network will keep overshooting the optimal values. The commonly used decay functions are step decay, exponential decay, polynomial decay, cosine annealing, etc. Since we're using a larger learning rate in the beginning, the neural network converges faster.

Different Decay Functions

Different Decay Functions

Training with Decaying Learning Rate

Training with Decaying Learning Rate

STANDARDIZATION

Most neural networks are used for classification, or otherwise with outputs ranging from -1 to 1. Applying same logic to inputs improves efficiency, allowing it to learn at a faster rate. This is because if the inputs are offset, the gradient at those points is vanished, which means the gradient at larger values is very low.

Sigmoid Function's Gradient

Sigmoid Function's Gradient

Other activations functions like ReLU or Leaky ReLU, have support for both negative & larger values, but input is still preferrable to be standardized. There are multiple methods of data standardization, but thats statistics. So I'll only cover the Z-Score Standardization technique, which performs linear transformation on the data points such that they range from -1 to 1. Linear transformation means that relations between the inputs is not lost.

Linear vs Non Linear Transformation

The gradient-vanishing concept can also be explained with the term saturation which means the input values are so large that the neuron immediately activates (losing its behaviours), and requires a large change in its input for it to change its output. Standardization fixes this problem. This extremely speeds up the learning process.

Training on Standardized Input

Training on Standardized Input

How I did it

Recognizing Handwritten Digits

I trained a small neural network, to recognize images of size 28x28 of handwritten digits. There are 2 hidden layers with 64 neurons each.

The training dataset comes from one of the most famous datasets, the MNIST Dataset of 70,000 images. Its sometimes called "Hello World" of Machine Learning. The dataset is divided into a 60,000 images for training, and 10,000 for testing (loss analysis). This split is performed randomly to avoid introducing arbitrary bias.

The Neural Network uses SGD with a exponentially decaying learning rate, Sigmoid activation function for hidden layers, and Softmax function for output layer, and Z-Score standardized input.

Recognizing Alphabets in Different Typefaces

I also trained another neural network to recognize English alphabets with different fonts image of size 50x50.

The dataset is generated using Python script. With 48 font supplied, a dataset of 70,0000+ images (225mb) is generated with 2 size variations per font for each 27 uppercase & lowercase English alphabet. The dataset is split 80-20% for training and testing, respectively.

The Neural Network used SGD with a exponentially decaying learning rate, Sigmoid activation function for hidden layers, and Softmax function for output layer, and Z-Score standardized of input.

Source Code

Source code for both networks, and dataset generator script is available on GitHub.

Final Thoughts

Machine Learning is a truly fascinating field. You can train a machine to perform any task. I plan to explore further by approximating 3D shapes with Machine Learning in an upcoming Case Study. Stay connected on my socials for updates.

You now have a solid foundation in Machine Learning, and I hope this article sparked your curiosity to venture out on your own.

To close off the article, I'd like to mention a common saying that machine learning is just a rebrand of statistics. Underneath it all, its all maths. Always!