tero.co.uk

Programming Code For A Simple Neural Network

Published 1 September 2015

This article is a continuation of our neural networks tutorial. This provides a practical implementation of a neural network in C. Each section of code cross references a section in the tutorial so you can relate the equations to the code.

Bias in a neural network

The previous article presented a three layer neural network, but we left out the bias to keep it simple. But the bias is needed for a real network. For example, imagine you have a friend who is 150cm tall and weighs 75kg. We might guess an equation to relate these two variables: weight = 0.5 * height. But this would imply that when your friend was a newborn baby and only 50cm long, s/he weighed 25kg, which would be an extremely heavy baby. So we can add a number to the equation: weight = 0.6 * height - 25. That equation roughly fits both a newborn baby and an adult. The 0.6 in this equation is like the weight in a neural network. And the number -25 is the bias. You can also think of it like an offset. In algebra it is the point where the line crosses the y-axis.

This bias can be connected into the neural network fairly easily. We can add it as a new fake node to the input and hidden layers. The bias nodes will always have a value of +1 and their connection weights will represent the number like -25 above. Then the network will look like this:

1 1

It will operate pretty much the same as before with a forward pass and backward propagation. The only difference is that the bias nodes will always be set to 1 and will never be updated.

Programming a basic neural network

This article gives the programming code for a basic neural network in C. I chose C because many other languages are based on it (Javascript, PHP, Java, C++, Cuda, OpenCL). It is also fast and (if you already have a C compiler) doesn't require any extra installations.

The objective of this code is to show all the operations of the network in full. If you were really programming this, you would do the matrix operations using a highly optimised linear algebra library. And you would dynamically allocate memory, and pass in options from the command line, and so on.

The code starts with some library includes:

#include <stdio.h> //for printf
#include <stdlib.h> //for rand
#include <string.h> //for memcpy
#include <math.h> //for tanh

Then some compiler definitions for the number of training examples and number of nodes in each layer. These could be set dynamically at run-time, but it's clearer to do them in advance.

#define NUM_TRAINING 3
#define NUM_INPUTS 3
#define NUM_HIDDEN 3
#define NUM_OUTPUTS 3

And C program has a "main" function:

int main (int argc, char **argv) {
	return 0;
}

Save this to a file called neuralnetwork.c. You'll need a C compiler to compile it such as gcc or g++ or Visual Studio. On Mac or Linux, you can compile and run on the command line with:

gcc neuralnetwork.c -o neuralnetwork.out; ./neuralnetwork.out

This first command compiles your file and creates an executable called neuralnetwork.out. The second command runs that file. At this point it won't do anything, but shouldn't give any errors.

The input and output

Within the main function, we'll first create the training examples. These are just two dimensional arrays of our binary representations of the letters. So an input of A=100 should give an output B=010.

	float inputs[NUM_TRAINING][NUM_INPUTS]   = {{1, 0, 0}, {0, 1, 0}, {0, 0, 1}}; //the 3 possible inputs ABC
	float outputs[NUM_TRAINING][NUM_OUTPUTS] = {{0, 1, 0}, {0, 0, 1}, {1, 0, 0}}; //the corresponding outputs BCA

Then we'll set a couple of parameters for the neural network. The learning rate specifies how fast it learns, and the number of iterations is how many times it goes through the inputs and outputs above before it has finished learning:

	float learningrate = 0.01;
	int iterations = 10000;

Initialise the weights

In code, the weights can be stored in a two dimensional array (a matrix). The size of the array depends on the number of nodes. For example, the weights between the input and hidden nodes has NUM_INPUTS+1 rows and NUM_HIDDEN columns. The +1 accounts for the bias node:

	float Wxh[NUM_INPUTS+1][NUM_HIDDEN]; //weights between inputs x and hidden nodes, including an extra one for bias
	float Why[NUM_HIDDEN+1][NUM_OUTPUTS]; //weights between hidden nodes and output y, including an extra one for bias

Now we randomly initialise these weights to small values around 0. The C rand function can do this. It gives a random number from 0 to the built-in constant RAND_MAX. We divide by that to get a number from 0 to 1, then reduce that to the range -0.1 to 0.1. As we are not seeding the random number geneator, it may well produce the same "random" numbers every time you run it. In a real network, this would be undesireable (and we'd probably also randomise according to a Gaussian distribution). For testing it's useful:

	int t, i, h, o; //looping variables for iterations, input nodes, hidden nodes, output nodes
	for (i=0; i<NUM_INPUTS+1; i++) for (h=0; h<NUM_HIDDEN; h++) Wxh[i][h] = ((float)rand() / (double)RAND_MAX) * 0.2 - 0.1;
	for (h=0; h<NUM_HIDDEN+1; h++) for (o=0; o<NUM_OUTPUTS; o++) Why[h][o] = ((float)rand() / (double)RAND_MAX) * 0.2 - 0.1;

Then we need to make space for a bunch of other variables used in the training:

	float x[NUM_INPUTS+1]; //input vector including one extra for bias
	float y[NUM_OUTPUTS]; //output vector
	float zhWeightedSums[NUM_HIDDEN]; //weighted sums for the hidden nodes
	float hActivationValues[NUM_HIDDEN+1]; //activation values of the hidden nodes, including one extra for the bias
	float zyWeightedSums[NUM_OUTPUTS]; //weighted sums for the output nodes
	float probabilities[NUM_OUTPUTS]; //activation values of the output nodes
	float outputErrors[NUM_OUTPUTS]; //error in the output
	float deltaWxh[NUM_INPUTS+1][NUM_HIDDEN]; //adjustments to weights between inputs x and hidden nodes
	float deltaWhy[NUM_HIDDEN+1][NUM_OUTPUTS]; //adjustments to weights between hidden nodes and output y
	float loss, sum; //for storing the loss

Training Loop

The code then begins an iterations loop. Each iteration specifies one input/output cycle. This code copies an input and output into the x and y arrays. x has NUM_INPUTS+1 entries for each input node. The first entry x[0] is the bias node and is always set to 1. The C memcpy copies the other 3 input nodes into x. y is for the output nodes, but it doesn't have a bias:

	for (t=0; t<iterations; t++) { //this is how many times we will train the network
		x[0]=1; memcpy (&x[1], inputs[t % NUM_TRAINING], sizeof (inputs[0])); //copy the input into x with the bias=1
		memcpy (y, outputs[t % NUM_TRAINING], sizeof (outputs[0])); //copy the output into y

Weighted sums for hidden nodes

Now the forward pass begins by computing the weighted sums for the hidden nodes. This is the vector product of the input vector x and the weights between the input and hidden nodes. It is computed as:

\(z_{hj} = \sum_{i=0}^{n_h} x_i w^{xh}_{ij}\)

In C code, we first set all the weighted sums to 0, and then loop through the inputs and weights adding up the products. In a real neural network, this would be done by an optimised linear algebra function or library. It's shown here explicitly for clarity:

		memset (zhWeightedSums, 0, sizeof (zhWeightedSums)); //set all the weighted sums to zero
		for (h=0; h<NUM_HIDDEN; h++) for (i=0; i<NUM_INPUTS+1; i++) zhWeightedSums[h] += x[i] * Wxh[i][h]; //multiply and sum inputs * weights

Activation values

Next we compute the activation values using the tanh function. The variable hActivationValues stores the values of the NUM_HIDDEN+1 hidden nodes. The 0th entry is the bias node and is always set to 1. The other entries are the tanh of the weighted sums from above:

\(h_{j+1} = \tanh\ (z_{hj})\)

		hActivationValues[0]=1; //set the bias for the first hidden node to 1
		for (h=0; h<NUM_HIDDEN; h++) hActivationValues[h+1] = tanh (zhWeightedSums[h]); //apply activation function on other hidden nodes

Weighted sums for output nodes

Now we do the weighted sums for the output layer by computing the vector product of the hidden node activation values and the weights between the hidden and output nodes:

\(z_{yj} = \sum_{i=1}^{n_y} h_i w^{hy}_{ij} \)

		memset (zyWeightedSums, 0, sizeof (zyWeightedSums)); //set all the weighted sums to zero
		//multiply and sum inputs * weights
		for (o=0; o<NUM_OUTPUTS; o++) for (h=0; h<NUM_HIDDEN+1; h++) zyWeightedSums[o] += hActivationValues[h] * Why[h][o];

Probable answer

Then we compute the probable answer using the softmax function.

\(p = \text{softmax}\ (z_y)\)

		for (sum=0, o=0; o<NUM_OUTPUTS; o++) {probabilities[o] = exp (zyWeightedSums[o]); sum += probabilities[o];} //compute exp(z) for softmax
		for (o=0; o<NUM_OUTPUTS; o++) probabilities[o] /= sum; //apply softmax by dividing by the the sum all the exps

Output error

The output error is the computed softmax probability minus the expected output y:

\(e_y = p - y\)

		for (o=0; o<NUM_OUTPUTS; o++) outputErrors[o] = probabilities[o] - y[o]; //the error for each output

Loss

The loss is a single number which tells us how well the network is doing. When the network is operating correctly, this loss should reduce after each iteration and eventually reach close to zero. This is a cross-entropy log loss for more than 2 ouputs:

\(J = - \sum_{i=1}^{n_y} y_j \log p_j \)

		for (loss=0, o=0; o<NUM_OUTPUTS; o++) loss -= y[o] * log (probabilities[o]); //the loss

Finding the gradient

This concludes the feed foward stage. And now we have to calculate how to adjust the weights to improve the network. This uses an algorithm called gradient descent which is based on the partial derivatives of the loss function.

\(\frac {\partial J} {\partial w^{hy}_{ij}} = h_i e_j \)

We use the C variable deltaWhy to store the partial derivatives. For the weights between the hidden and output layers, it is computed by multiplying each hidden node activation value times the output errors:

		for (h=0; h<NUM_HIDDEN+1; h++) for (int o=0; o<NUM_OUTPUTS; o++) deltaWhy[h][o] = hActivationValues[h] * outputErrors[o];

Backward propagation

In order to adjust the weights between the input and hidden nodes, we need to backward propagate the error. This involves multiplying the output error by the hidden/output weight matrix:

\(e_h = e_y w^{hy}\)

In the code, we reuse the hActivationValues variable for this. Note that it starts at index 1 because hActivationValues[0] is the bias node and isn't involved in backwards propagation. We first set it to all zeros before doing the sums:

		memset (hActivationValues, 0, sizeof (hActivationValues)); //set all the weighted sums to zero
		for (h=1; h<NUM_HIDDEN+1; h++) for (o=0; o<NUM_OUTPUTS; o++) hActivationValues[h] += Why[h][o] * outputErrors[o];

Then we back prop through the tanh activation function by calculating its gradient and multiplying by the backwards propagated error above:

\(z_{eh} = e_h \odot (1 - \tanh^2 (z_h))\)

 		//apply activation function gradient
		for (h=0; h<NUM_HIDDEN; h++) zhWeightedSums[h] = hActivationValues[h+1] * (1 - pow (tanh (zhWeightedSums[h]), 2));

Finally the changes to the weights between the input and hidden nodes are the input times the back propagated weighted sums:

\(\delta w^{xh} = x^T z_{eh}\)

		//Multiply x*eh*zh to get the adjustments to deltaWxh, this does not include the bias node
		for (int i=0; i<NUM_INPUTS+1; i++) for (int h=0; h<NUM_HIDDEN; h++) deltaWxh[i][h] = x[i] * zhWeightedSums[h];

Changing the weights

We've now computed the gradients and we can use them to change the weights. We multiply by the learning rate and then subtract from the existing weights:

		for (int h=0; h<NUM_HIDDEN+1; h++) for (int o=0; o<NUM_OUTPUTS; o++) Why[h][o] -= learningrate * deltaWhy[h][o];
		for (int i=0; i<NUM_INPUTS+1; i++) for (int h=0; h<NUM_HIDDEN; h++) Wxh[i][h] -= learningrate * deltaWxh[i][h];

Results

And that completes a single iteration! Once we have done this hundreds or thousands of times, the loss settles down to a value close to zero and the network has learned. You can download the full code which also includes a debugging variable and a function for displaying vectors and matrices. If you run it as is, you should see a decreasing loss:

Iteration:          1, loss:    1.99973
Iteration:          2, loss:    1.96918
Iteration:          3, loss:    1.76910
Iteration:          4, loss:    1.99869
Iteration:          5, loss:    1.96801
Iteration:          6, loss:    1.77050
Iteration:          7, loss:    1.99766
Iteration:          8, loss:    1.96684
Iteration:          9, loss:    1.77189
Iteration:         10, loss:    1.99664
Iteration:         11, loss:    1.96568
Iteration:         12, loss:    1.77326
Iteration:         13, loss:    1.99563
Iteration:         14, loss:    1.96452
Iteration:         15, loss:    1.77461
...
Iteration:       9997, loss:    0.01497
Iteration:       9998, loss:    0.01678
Iteration:       9999, loss:    0.02031
Iteration:      10000, loss:    0.01496

Followed by the ending weights. These don't look much like the weights I predicted, but they work:

   input/hidden weights:
  -0.07408   -0.00894    0.21300
  -0.15978   -1.23593   -1.67984
  -1.59281    0.94939    0.94431
   1.76659    0.37032    0.88094
   hidden/output weights:
   0.00746    0.25910   -0.38276
   2.66296   -0.20895   -2.51230
   0.38154   -1.59694    1.08895
   1.15777   -2.25131    1.23477

And then you'll see the predicted output for the last input/output pair, which proves that it is working:

   last input:
   1.00000    0.00000    0.00000 
   last output:
   0.00000    1.00000    0.00000 
   predicted output:
   0.00349    0.99254    0.00397 

For the input A=100, it predicted the correct answer B=010 with over 99% confidence. If you set it to run for a million iterations it will reach 99.99% confidence. You can also increase the learning rate to 0.1 to make it learn more quickly. This is such a simple network that a relatively high learning rate still works.

Running the network

In a real network, this training process could take hours or days rather than seconds, so you would be careful to save the computed weights to a file. When you were ready to run the network, you would just need to load the file and do the feed forward part of the algorithm above to get the predicted probabilities.

Stochastic gradient descent

The type of gradient descent used here is called stochastic gradient descent, where training examples are fed into the network one at a time. For a network like this simple example, you would usually use batch gradient descent, giving all the training examples at once. In batch gradient descent x and y and other variables in the code above would be matrices rather than vectors. Also the loss should decrease every iteration, and not with small jumps in between as above. There are also other approaches such as mini-batch gradient descent which does the descent in small batches.

Conclusion

This article has provided some C code for the network described in the neural networks tutorial.

Programming Code For A Simple Neural Network tero.co.uk

Programming Code For A Simple Neural Network

Published 1 September 2015

This article is a continuation of our neural networks tutorial. This provides a practical implementation of a neural network in C. Each section of code cross references a section in the tutorial so you can relate the equations to the code.

Bias in a neural network

The previous article presented a three layer neural network, but we left out the bias to keep it simple. But the bias is needed for a real network. For example, imagine you have a friend who is 150cm tall and weighs 75kg. We might guess an equation to relate these two variables: weight = 0.5 * height. But this would imply that when your friend was a newborn baby and only 50cm long, s/he weighed 25kg, which would be an extremely heavy baby. So we can add a number to the equation: weight = 0.6 * height - 25. That equation roughly fits both a newborn baby and an adult. The 0.6 in this equation is like the weight in a neural network. And the number -25 is the bias. You can also think of it like an offset. In algebra it is the point where the line crosses the y-axis.

This bias can be connected into the neural network fairly easily. We can add it as a new fake node to the input and hidden layers. The bias nodes will always have a value of +1 and their connection weights will represent the number like -25 above. Then the network will look like this:

1 1

It will operate pretty much the same as before with a forward pass and backward propagation. The only difference is that the bias nodes will always be set to 1 and will never be updated.

Programming a basic neural network

This article gives the programming code for a basic neural network in C. I chose C because many other languages are based on it (Javascript, PHP, Java, C++, Cuda, OpenCL). It is also fast and (if you already have a C compiler) doesn't require any extra installations.

The objective of this code is to show all the operations of the network in full. If you were really programming this, you would do the matrix operations using a highly optimised linear algebra library. And you would dynamically allocate memory, and pass in options from the command line, and so on.

The code starts with some library includes:

#include <stdio.h> //for printf
#include <stdlib.h> //for rand
#include <string.h> //for memcpy
#include <math.h> //for tanh

Then some compiler definitions for the number of training examples and number of nodes in each layer. These could be set dynamically at run-time, but it's clearer to do them in advance.

#define NUM_TRAINING 3
#define NUM_INPUTS 3
#define NUM_HIDDEN 3
#define NUM_OUTPUTS 3

And C program has a "main" function:

int main (int argc, char **argv) {
	return 0;
}

Save this to a file called neuralnetwork.c. You'll need a C compiler to compile it such as gcc or g++ or Visual Studio. On Mac or Linux, you can compile and run on the command line with:

gcc neuralnetwork.c -o neuralnetwork.out; ./neuralnetwork.out

This first command compiles your file and creates an executable called neuralnetwork.out. The second command runs that file. At this point it won't do anything, but shouldn't give any errors.

The input and output

Within the main function, we'll first create the training examples. These are just two dimensional arrays of our binary representations of the letters. So an input of A=100 should give an output B=010.

	float inputs[NUM_TRAINING][NUM_INPUTS]   = {{1, 0, 0}, {0, 1, 0}, {0, 0, 1}}; //the 3 possible inputs ABC
	float outputs[NUM_TRAINING][NUM_OUTPUTS] = {{0, 1, 0}, {0, 0, 1}, {1, 0, 0}}; //the corresponding outputs BCA

Then we'll set a couple of parameters for the neural network. The learning rate specifies how fast it learns, and the number of iterations is how many times it goes through the inputs and outputs above before it has finished learning:

	float learningrate = 0.01;
	int iterations = 10000;

Initialise the weights

In code, the weights can be stored in a two dimensional array (a matrix). The size of the array depends on the number of nodes. For example, the weights between the input and hidden nodes has NUM_INPUTS+1 rows and NUM_HIDDEN columns. The +1 accounts for the bias node:

	float Wxh[NUM_INPUTS+1][NUM_HIDDEN]; //weights between inputs x and hidden nodes, including an extra one for bias
	float Why[NUM_HIDDEN+1][NUM_OUTPUTS]; //weights between hidden nodes and output y, including an extra one for bias

Now we randomly initialise these weights to small values around 0. The C rand function can do this. It gives a random number from 0 to the built-in constant RAND_MAX. We divide by that to get a number from 0 to 1, then reduce that to the range -0.1 to 0.1. As we are not seeding the random number geneator, it may well produce the same "random" numbers every time you run it. In a real network, this would be undesireable (and we'd probably also randomise according to a Gaussian distribution). For testing it's useful:

	int t, i, h, o; //looping variables for iterations, input nodes, hidden nodes, output nodes
	for (i=0; i<NUM_INPUTS+1; i++) for (h=0; h<NUM_HIDDEN; h++) Wxh[i][h] = ((float)rand() / (double)RAND_MAX) * 0.2 - 0.1;
	for (h=0; h<NUM_HIDDEN+1; h++) for (o=0; o<NUM_OUTPUTS; o++) Why[h][o] = ((float)rand() / (double)RAND_MAX) * 0.2 - 0.1;

Then we need to make space for a bunch of other variables used in the training:

	float x[NUM_INPUTS+1]; //input vector including one extra for bias
	float y[NUM_OUTPUTS]; //output vector
	float zhWeightedSums[NUM_HIDDEN]; //weighted sums for the hidden nodes
	float hActivationValues[NUM_HIDDEN+1]; //activation values of the hidden nodes, including one extra for the bias
	float zyWeightedSums[NUM_OUTPUTS]; //weighted sums for the output nodes
	float probabilities[NUM_OUTPUTS]; //activation values of the output nodes
	float outputErrors[NUM_OUTPUTS]; //error in the output
	float deltaWxh[NUM_INPUTS+1][NUM_HIDDEN]; //adjustments to weights between inputs x and hidden nodes
	float deltaWhy[NUM_HIDDEN+1][NUM_OUTPUTS]; //adjustments to weights between hidden nodes and output y
	float loss, sum; //for storing the loss

Training Loop

The code then begins an iterations loop. Each iteration specifies one input/output cycle. This code copies an input and output into the x and y arrays. x has NUM_INPUTS+1 entries for each input node. The first entry x[0] is the bias node and is always set to 1. The C memcpy copies the other 3 input nodes into x. y is for the output nodes, but it doesn't have a bias:

	for (t=0; t<iterations; t++) { //this is how many times we will train the network
		x[0]=1; memcpy (&x[1], inputs[t % NUM_TRAINING], sizeof (inputs[0])); //copy the input into x with the bias=1
		memcpy (y, outputs[t % NUM_TRAINING], sizeof (outputs[0])); //copy the output into y

Weighted sums for hidden nodes

Now the forward pass begins by computing the weighted sums for the hidden nodes. This is the vector product of the input vector x and the weights between the input and hidden nodes. It is computed as:

\(z_{hj} = \sum_{i=0}^{n_h} x_i w^{xh}_{ij}\)

In C code, we first set all the weighted sums to 0, and then loop through the inputs and weights adding up the products. In a real neural network, this would be done by an optimised linear algebra function or library. It's shown here explicitly for clarity:

		memset (zhWeightedSums, 0, sizeof (zhWeightedSums)); //set all the weighted sums to zero
		for (h=0; h<NUM_HIDDEN; h++) for (i=0; i<NUM_INPUTS+1; i++) zhWeightedSums[h] += x[i] * Wxh[i][h]; //multiply and sum inputs * weights

Activation values

Next we compute the activation values using the tanh function. The variable hActivationValues stores the values of the NUM_HIDDEN+1 hidden nodes. The 0th entry is the bias node and is always set to 1. The other entries are the tanh of the weighted sums from above:

\(h_{j+1} = \tanh\ (z_{hj})\)

		hActivationValues[0]=1; //set the bias for the first hidden node to 1
		for (h=0; h<NUM_HIDDEN; h++) hActivationValues[h+1] = tanh (zhWeightedSums[h]); //apply activation function on other hidden nodes

Weighted sums for output nodes

Now we do the weighted sums for the output layer by computing the vector product of the hidden node activation values and the weights between the hidden and output nodes:

\(z_{yj} = \sum_{i=1}^{n_y} h_i w^{hy}_{ij} \)

		memset (zyWeightedSums, 0, sizeof (zyWeightedSums)); //set all the weighted sums to zero
		//multiply and sum inputs * weights
		for (o=0; o<NUM_OUTPUTS; o++) for (h=0; h<NUM_HIDDEN+1; h++) zyWeightedSums[o] += hActivationValues[h] * Why[h][o];

Probable answer

Then we compute the probable answer using the softmax function.

\(p = \text{softmax}\ (z_y)\)

		for (sum=0, o=0; o<NUM_OUTPUTS; o++) {probabilities[o] = exp (zyWeightedSums[o]); sum += probabilities[o];} //compute exp(z) for softmax
		for (o=0; o<NUM_OUTPUTS; o++) probabilities[o] /= sum; //apply softmax by dividing by the the sum all the exps

Output error

The output error is the computed softmax probability minus the expected output y:

\(e_y = p - y\)

		for (o=0; o<NUM_OUTPUTS; o++) outputErrors[o] = probabilities[o] - y[o]; //the error for each output

Loss

The loss is a single number which tells us how well the network is doing. When the network is operating correctly, this loss should reduce after each iteration and eventually reach close to zero. This is a cross-entropy log loss for more than 2 ouputs:

\(J = - \sum_{i=1}^{n_y} y_j \log p_j \)

		for (loss=0, o=0; o<NUM_OUTPUTS; o++) loss -= y[o] * log (probabilities[o]); //the loss

Finding the gradient

This concludes the feed foward stage. And now we have to calculate how to adjust the weights to improve the network. This uses an algorithm called gradient descent which is based on the partial derivatives of the loss function.

\(\frac {\partial J} {\partial w^{hy}_{ij}} = h_i e_j \)

We use the C variable deltaWhy to store the partial derivatives. For the weights between the hidden and output layers, it is computed by multiplying each hidden node activation value times the output errors:

		for (h=0; h<NUM_HIDDEN+1; h++) for (int o=0; o<NUM_OUTPUTS; o++) deltaWhy[h][o] = hActivationValues[h] * outputErrors[o];

Backward propagation

In order to adjust the weights between the input and hidden nodes, we need to backward propagate the error. This involves multiplying the output error by the hidden/output weight matrix:

\(e_h = e_y w^{hy}\)

In the code, we reuse the hActivationValues variable for this. Note that it starts at index 1 because hActivationValues[0] is the bias node and isn't involved in backwards propagation. We first set it to all zeros before doing the sums:

		memset (hActivationValues, 0, sizeof (hActivationValues)); //set all the weighted sums to zero
		for (h=1; h<NUM_HIDDEN+1; h++) for (o=0; o<NUM_OUTPUTS; o++) hActivationValues[h] += Why[h][o] * outputErrors[o];

Then we back prop through the tanh activation function by calculating its gradient and multiplying by the backwards propagated error above:

\(z_{eh} = e_h \odot (1 - \tanh^2 (z_h))\)

 		//apply activation function gradient
		for (h=0; h<NUM_HIDDEN; h++) zhWeightedSums[h] = hActivationValues[h+1] * (1 - pow (tanh (zhWeightedSums[h]), 2));

Finally the changes to the weights between the input and hidden nodes are the input times the back propagated weighted sums:

\(\delta w^{xh} = x^T z_{eh}\)

		//Multiply x*eh*zh to get the adjustments to deltaWxh, this does not include the bias node
		for (int i=0; i<NUM_INPUTS+1; i++) for (int h=0; h<NUM_HIDDEN; h++) deltaWxh[i][h] = x[i] * zhWeightedSums[h];

Changing the weights

We've now computed the gradients and we can use them to change the weights. We multiply by the learning rate and then subtract from the existing weights:

		for (int h=0; h<NUM_HIDDEN+1; h++) for (int o=0; o<NUM_OUTPUTS; o++) Why[h][o] -= learningrate * deltaWhy[h][o];
		for (int i=0; i<NUM_INPUTS+1; i++) for (int h=0; h<NUM_HIDDEN; h++) Wxh[i][h] -= learningrate * deltaWxh[i][h];

Results

And that completes a single iteration! Once we have done this hundreds or thousands of times, the loss settles down to a value close to zero and the network has learned. You can download the full code which also includes a debugging variable and a function for displaying vectors and matrices. If you run it as is, you should see a decreasing loss:

Iteration:          1, loss:    1.99973
Iteration:          2, loss:    1.96918
Iteration:          3, loss:    1.76910
Iteration:          4, loss:    1.99869
Iteration:          5, loss:    1.96801
Iteration:          6, loss:    1.77050
Iteration:          7, loss:    1.99766
Iteration:          8, loss:    1.96684
Iteration:          9, loss:    1.77189
Iteration:         10, loss:    1.99664
Iteration:         11, loss:    1.96568
Iteration:         12, loss:    1.77326
Iteration:         13, loss:    1.99563
Iteration:         14, loss:    1.96452
Iteration:         15, loss:    1.77461
...
Iteration:       9997, loss:    0.01497
Iteration:       9998, loss:    0.01678
Iteration:       9999, loss:    0.02031
Iteration:      10000, loss:    0.01496

Followed by the ending weights. These don't look much like the weights I predicted, but they work:

   input/hidden weights:
  -0.07408   -0.00894    0.21300
  -0.15978   -1.23593   -1.67984
  -1.59281    0.94939    0.94431
   1.76659    0.37032    0.88094
   hidden/output weights:
   0.00746    0.25910   -0.38276
   2.66296   -0.20895   -2.51230
   0.38154   -1.59694    1.08895
   1.15777   -2.25131    1.23477

And then you'll see the predicted output for the last input/output pair, which proves that it is working:

   last input:
   1.00000    0.00000    0.00000 
   last output:
   0.00000    1.00000    0.00000 
   predicted output:
   0.00349    0.99254    0.00397 

For the input A=100, it predicted the correct answer B=010 with over 99% confidence. If you set it to run for a million iterations it will reach 99.99% confidence. You can also increase the learning rate to 0.1 to make it learn more quickly. This is such a simple network that a relatively high learning rate still works.

Running the network

In a real network, this training process could take hours or days rather than seconds, so you would be careful to save the computed weights to a file. When you were ready to run the network, you would just need to load the file and do the feed forward part of the algorithm above to get the predicted probabilities.

Stochastic gradient descent

The type of gradient descent used here is called stochastic gradient descent, where training examples are fed into the network one at a time. For a network like this simple example, you would usually use batch gradient descent, giving all the training examples at once. In batch gradient descent x and y and other variables in the code above would be matrices rather than vectors. Also the loss should decrease every iteration, and not with small jumps in between as above. There are also other approaches such as mini-batch gradient descent which does the descent in small batches.

Conclusion

This article has provided some C code for the network described in the neural networks tutorial.