I have wanted to start learning neural networks and ML in general for some time now. In the past, I have tried a well-known course called “Practical Deep Learning for Coders” from the folks at fast.ai. It is a great course and this post is based off of what I’ve learned from it so far. One thing I have discovered is that the core moving parts of machine learning and neural networks are a lot less complicated than I first believed.
While it is amazingly productive to use high-level machine learning and neural network libraries alongside a powerful GPU to create powerful ‘AI’ applications, I feel that it is also important to understand how it all actually works from a ‘nuts-and-bolts’ perspective. As a software engineer I like to fully understand at least one layer below the layer I am working on in any particular system. If you cannot conceptualize a system you cannot confidently use it, IMHO.
My goal became to write my own neural network from scratch. No libraries are allowed, but performance does not matter much at all for this project. The goal here is to focus on learning the fundamentals. The approach I took was to find a basic presentation of a complete neural network and then see if I could achieve the same results by writing my own code in a language I am very familiar with.
With this goal in mind, I went back to the fast.ai course mentioned above and I started watching Jeremy Howard’s lectures again. Eventually, I got to a portion of a lecture where he introduces neural networks using an Excel spreadsheet. This was exactly the sort of thing I was looking for. It is reasonable to replicate the functionality of this spreadsheet in code so I am going to do just that.
Titanic Data, not Big Data
This spreadsheet example project uses data from the Titanic disaster. The following information is a generated example of the data I am working with. The full set of real data can be downloaded from Kaggle. It is also available as part of the GitHub repo for this project.
survived | class | sex | age | par/ch | sib/sp | port | fare |
1 | 1 | M | 19 | 1 | 1 | S | 22.12 |
0 | 2 | F | 21 | 2 | 0 | C | 75.00 |
0 | 3 | M | 33 | 0 | 1 | Q | 4.99 |
1 | 1 | F | 12 | 3 | 2 | S | 1.00 |
Table 1 – generated data
Each row in the table represents information for a single passenger. The columns contain the following information:
survived
– Did the passenger survive?class
– What class was the passenger traveling in?sex
– What sex was the passenger?age
– What age was the passenger at the time?par/ch
– How many parents and children of the passenger were aboard?sib/sp
– How many siblings and spouses of the passenger were aboard?port
– What port did the passenger embark from?fare
– How much was the passenger’s fare?
The goal of the my model will be to predict the first column (survived
) using the subsequent data points as inputs.
Normalization, It’s Like Middle School For Data
In order to achieve my goal, I will need to put the data through some changes to make it more conforming and compatible. For example, how do I compare the value of port
to the fare
value? One is a number, and the other is a string!
I must normalize the data to have all of the values be a comparable set of numbers. The normalized form of the data in Table 1 above is the following in Table 2.
class 2 | class 1 | sex | age | par/ch | sib/sp | port S | port C | Fare |
1 | 0 | 1 | 0.12 | 0.33 | 0.5 | 1 | 0 | 0.48 |
0 | 1 | 0 | 0.73 | 0.66 | 0 | 0 | 1 | 1.34 |
0 | 0 | 1 | 0.40 | 0 | 0.25 | 0 | 0 | 2.12 |
1 | 0 | 0 | 0.22 | 1 | 1 | 1 | 0 | 0.10 |
Table 2 – generated data
Categorical columns with more than two options are separated into multiple boolean columns indicating if that category is applicable. Numerical columns like age
are converted to relative values between zero and one. Finally, the fare
column represents a monetary value so I take the log10()
of the value. The fast.ai courses instruct that money often is best represented as a logarithmic value to best represent its real-world impact in a model so I am doing the same here.
One very important aspect of the normalized data is that the absolute values are all within a similar range. This ensures each data point carries about the same amount of weight in the model as the others. The survived
column has been removed since that is the value I am attempting to predict. These values are saved elsewhere in the model for calculating loss and for training.
Model Parameters
The data in Table 2 above will be the input data for the model. In the model, each column in the table has a single corresponding parameter. These parameters are the values I will be training. The goal of training is to find the optimal values for each parameter so that when we use them to infer or predict the outcome for any single passenger, the result is equal to (or rounds to) the value of the survived
column (i.e. it accurately predicts their survival status). This prediction making is a basic operation of multiplying the row’s values by the corresponding parameter and summing those products. I will cover this process more in the next section. Table 3 below shows each of our parameters, except that we do not know what the values should be.
class 1 | class 2 | sex | age | par/ch | sib/sp | port S | port C | fare |
? | ? | ? | ? | ? | ? | ? | ? | ? |
Table 3 – The parameters, optimal values unknown.
To begin training our model parameters we need some value to use as a starting point for each one. The computer can’t do math on question marks! So I simply assign a random number to each parameter to give it a starting point. It is important that the randomly generated values be selected from within the same range as the normalized values from Table 2.
Class 1 | Class 2 | sex | age | par/ch | sib/sp | port S | port C | fare |
-0.4 | 0.01 | 0.09 | -0.81 | 0.75 | 0.43 | 0.23 | -0.64 | 0.57 |
Table 4 – The parameters, initialized with random values I just made up.
Making a Prediction
Predictions are made on a per-row basis. As mentioned above making a prediction involves multiplying each value in the row by its corresponding parameter and then summing those products. The code to perform the predictions is straightforward and is as follows:
$prediction = 0.0;
foreach ($this->params as $col => $parameter)
$prediction += ($parameter * $row[$col]);
}
echo $prediction;
Code language: PHP (php)
I loop through each parameter in the model and multiply it by the corresponding value from the data row. Those products are then added together to create the final prediction value for the row.
Supporting Code
The entire model will fit into a single class which I will later extend to implement neural networking. Let’s will start with a new class and some basic properties.
class LearningMachine
{
protected array $dataframe; # Input data (i.e. Table 2)
protected array $labelData
protected array $params; # Model parameters (i.e. Table 3)
}
Code language: PHP (php)
$dataframe
contains the normalized data table (Table 2).$labelData
contains an array with an entry for each row in the dataframe. It contains thesurvived
value for each row.$params
contains an array with a single value for each column in the data table. These are the parameters I will be adjusting during the training phase in order to make more accurate predictions. At this stage the values are empty but they will be initialized during setup.
The following methods are some housekeeping/boilerplate methods that I will not cover in detail for this post. If you’d like to read through them you can always view the source code. The normalization code is fairly bare-bones and really only covers what is needed in this small data set.
public function unpack($p): void { ... } # Read user options
protected function categorize(): array { ... } # Normalize categorical cols
public function loadTrainingData(): array { ... } # Load data from file.
public function normalize(): void { ... } # Normalize data
protected function cast(string $datatype, mixed $value): mixed { ... }
Code language: PHP (php)
Bootstrapping the Model
When I run the model it will perform five main steps: setup user’s options, load data, normalize the data, initialize parameters, and train the parameters.
public function run(array $options): float
{
$this->unpack($options); # Hydrate object properties with the passed options.
$this->loadTrainingData(); # Load the CSV file into memory.
$this->normalize(); # Normalize the data.
$this->params = $this->initParams(); # Initialize model parameters.
$loss = $this->descendGradient(); # Run the training loop.
return $loss;
}
Code language: PHP (php)
A brief description of each of these methods follows:
unpack()
takes the options passed by the user and assigns them to the correct object properties.loadTrainingData()
opens the CSV file, parses, and loads the data into memory.normalize()
uses information passed in the options to process the training data into a normalized data set that can be used by the code.initParams()
sets the initial values of the parameters to a random value.descendGradient()
finally runs the training process and it will be the focus of the rest of the article.
Again, for more detail on these methods view the source code on GitHub.
The Training Loop
The parameters have been set to random values in a previous section. In this section, I begin adjusting the values of the parameters in order to get better predictions. The descendGradient()
method controls this process which is sometimes called the training loop.
public function descendGradient(): float
{
# Run the loop for the specified number of iterations.
for ($i = 0; $i < $this->iterations; $i++) {
$loss = $this->train($this->params);
$this->params = $this->adjustParams($this->params);
}
return $loss;
}
Code language: PHP (php)
The number of iterations the loop will perform is passed by the user in the options array. This represents the number of times it will train the parameters. Inside the loop, it perform calculations to train or adjust the parameters towards their optimal values.
Making Predictions, Taking Losses
The train()
method accepts a set of parameters and uses them to make predictions. The average loss of these predictions is calculated and returned.
protected function train(array $params): float
{
# Make the predictions using the current parameters.
$predictions = $this->makePredictions(
$params,
$this->dataframe
);
# Calculate the loss for each prediction.
$loss = $this->loss($predictions);
# Return the average loss.
$lossAvg = array_sum($loss) / count($loss);
return $lossAvg;
}
Code language: PHP (php)
The calculations in the makePredictions()
method work as the code described in the Making Predictions section above. Loss calculations are described below but once I have them I simply return the average of the loss values.
Calculating Loss
The loss function simply iterates over the predictions and calculates the loss for each one. The loss represents, to the extent it can, the accuracy of the prediction. The lower the loss value the more accurate the prediction. I am calculating the loss using the Mean Squared Error.
protected function loss(array $predictions): array
{
# Calculate the loss for each prediction.
$loss = [];
foreach ($predictions as $serial => $prediction) {
$label = $this->labelData[$serial]; # Actual outcome (e.g. ``survived`` col)
$loss[$serial] = ($prediction - $label) ** 2; # Mean Squared Error
}
return $loss;
}
Code language: PHP (php)
For each of the predictions from the train()
method I calculate the loss. This is done by subtracting the corresponding survived
value from the prediction and then squaring it.
The loss value can be reported back to the user so they can track the loss after each training iteration.
Adjusting the Parameters
The partial derivative of each parameter is calculated and then that is multiplied by the user-provided learning rate. This provides a new, trained value for the parameter. I will describe the partial derivative function more below.
protected function adjustParams($params): array
{
# Copy params so I can adjust them without affecting the original.
$adjParams = $params;
foreach ($params as $param => $value) {
# Calculate the partial derivative of the loss function to
# with respect the parameter and adjust it using the learning
# rate as a multiplier.
$p = $value - $this->learningRate * $this->partialDerivative($param);
$adjParams[$param] = $p;
}
# Return the adjusted parameters.
return $adjParams;
}
Code language: PHP (php)
Gradient Descent or the Mathematics of Going Downhill
Consider the process I have already described of making a prediction and calculating its loss as a single “function”. I send input into the function (data and parameters) and it returns an output (the loss). The goal is to minimize the loss or output of the function. So which set of inputs produces the smallest output? That is the $$$ question!
As an example, I could iterate through every through every possible value of a single parameter in the model while leaving the others the same. Let’s say I run predictions and calculate losses for every value within the model’s range (-1 to 1) and at a high resolution step like +/-0.00001 or something. I could graph the each of the parameter and resulting loss values on a line in a two-dimensional plane (x, y). This line is a visual representation of the function and the lowest points would be where the parameter value provides the most accurate predictions. To get to the best answers though I would need to do this exercise for all parameters, in every possible combination of every value. In other words, it would take a very, very long time.
Gradient descent is a method for estimating the lowest result from the loss function. Instead of calculating every possibility I can use calculus to tell us if a particular parameter value is pointing us in the right direction i.e. that I am going down the gradient of the loss function.
In order to do this I calculate the derivative of the loss function at the current iteration. The derivative tells us the current or instantaneous slope of the loss function. A positive results means I am ascending and a negative results means I am descending.
It is important to point out that I am actually calculating partial derivatives and then combining them. Since the function is multivariate (i.e. it has more than one parameter) I must isolate each parameter so I can calculate one derivative at a time. The exact math of the derivative function depends on which loss function is used. The MSE function is common so I did not have to learn calculus and figure out this math myself. [If I had to bet where I have made mistakes this method would be at the top of my list though.]
protected function partialDerivative(string $param): float
{
$results = [];
foreach ($this->dataframe as $i => $row) {
$truth = $this->labelData[$i]; # The actual outcome.
$results[$i] = ($truth - $this->predict($this->params, $row)) * $row[$param];
}
$rowCount = count($this->dataframe);
return (-2 / $rowCount) * array_sum($results);
}
Code language: PHP (php)
A brief overview of this function: For each row, the outcome value (Survived = 1, Died = 0) is looked up. I then make a prediction from the data in the row using current parameters. The prediction is subtracted from the outcome value and that result is finally multiplied by the value in the row for the parameter I am considering. I divide the row count by negative two and multiply that by the sum of the previous calculations. The result is the partial derivative and I return it back to the adjustParams()
method which I reviewed above.
Running the Linear Regression
So far I have implemented machine learning but I have not implemented a neural network. I will get to that, but first let’s run what I have so far.
php-nn $ php nn.php master ] 10:56 PM
Running Linear Regression with 10 parameters...
loss was 0.14422714930732
Code language: Bash (bash)
Let’s compare that with the spreadsheet from fast.ai
The loss values match up! If you were to run the script and inspect the final parameter values you will see they line up as well! I have successfully implemented a machine learning algorithm. This is actually most of the work for implementing a neural network but there is one more layer I need to add for that. I will do this by creating a subclass of the LearningMachine
class.