This tutorial continues from the last neural network tutorial, the Perceptron tutorial. We will now introduce the structure of the multi-layer perceptron and the back-propagation algorithm, without doubt the most popular neural network structure to date. If you are in a hurry and just want to mess with the code you can get it from here but I would recommend reading on to see how the network functions.

#### Tutorial Prerequisites

- The reader should be familiar with the perceptron neural network
- The reader should have a basic understanding of C/C++
- The reader should know how to compile and run a program from the command line in Windows or Linux

#### Tutorial goals

- The reader will understand the structure of the Multi-layer Perceptron neural network
- The reader will understand the back-propagation algorithm
- The reader will know about the wide array of applications this network is used in
- The reader will learn all the above via an actual practical application in optical character recognition

## Tutorial Body

This network was introduced around 1986 with the advent of the back-propagation algorithm. Until then there was no rule via which we could train neural networks with more than one layer. As the name implies, a Multi-layer Perceptron is just that, a network that is comprised of many neurons, divided in layers. These layers are divided as follows:

- The
**input layer**, where the input of the network goes. The number of neurons here depends on the number of inputs we want our network to get - One or more
**hidden layers**. These layers come between the input and the output and their number can vary. The function that the hidden layer serves is to encode the input and map it to the output. It has been proven that a multi-layer perceptron with only one hidden layer can approximate any function that connects its input with its outputs if such a function exists. - The
**output layer**, where the outcome of the network can be seen. The number of neurons here depends on the problem we want the neural net to learn

The Multi-layer perceptron differs from the simple perceptron in many ways. The same part is that of weight randomization. All weights are given random values between a certain range, usually [-0.5,0.5]. Having that aside though, for each pattern that is fed to the network three passes over the net are made. Let’s see them one by one in detail.

### Calculating the output

In this phase we calculate the output of the network. For each layer, we calculate the firing value of each neuron by getting the sum of the products of the multiplications of all the neurons connected to said neuron from the previous layer and their corresponding weights. That sounded a little big though so here it is in pseudocode:

for(int i = 0; i < previousLayerNeurons; i ++) value[neuron,layer] += weight(i,neuron) * value[i,layer-1]; value[neuron,layer] = activationFunction(value[neuron,layer]); |

As can be seen from the pseudocode, here too we have activation functions. They are used to normalize the output of each neuron and the functions that are most commonly used in the perceptron apply here too.So, we gradually propagate forward in the network until we reach the output layer, and create some output values. Just like the perceptron these values are initially completely random and have nothing to do with our goal values. But it is here that the back-propagation learning algorithm kicks in.

### Back propagation

The back propagation learning algorithm uses the delta-rule. What this does is that it computes the deltas, (local gradients) of each neuron starting from the output neurons and going backwards until it reaches the input layer. To compute the deltas of the output neurons though we first have to get the error of each output neuron. That’s pretty simple, since the multi-layer perceptron is a supervised training network so the error is the difference between the network’s output and the desired output.

** e _{j}(n) = d_{j}(n) – o_{j}(n) **

where e(n) is the error vector, d(n) is the desired output vector and o(n) is the actual output vector. Now to compute the deltas:

** delta _{j}^{(L)}(n) = e_{j}^{(L)}(n) * f'(u_{j}^{(L)}(n))** , for neuron j in the output layer L

where f'(u_{j}^{(L)}(n)) is the derivative of the value of the jth neuron of layer L

**delta _{j}^{(l)}(n) = f'(u_{j}^{(l)}(n)) Σ_{k}(delta_{k}^{(l+1)}(n)*w_{kj}^{(l+1)}(n)) **, for neuron j in hidden layer l

where f'(u_{j}^{(l)}(n)) is the derivative of the value of the jth neuron in layer l and inside the Sum we have the products of all the deltas of the neurons of the next layer multiplied by their corresponding weights.

This part is a very important part of the delta rule and the whole essence of back propagation. Why you might ask? Because as high school math teaches us, a derivative is how much a function changes as its input changes. By propagating the derivatives backwards , we are informing all the neurons in the previous layers of the change that is needed in our weights to match the desired output. And all that starts from the initial error calculation at the output layer. Just like magic!

## Weight adjustment

Having calculated the deltas for all the neurons we are now ready for the third and final pass of the network, this time to adjust the weights according to the generalized delta rule:

w_{ji}^{(l)}(n+1) = w_{ji}^{(l)}(n) + α * [w_{ji}^{(l)}(n) – w_{ji}^{(l)}(n-1)] + η * delta_{j}^{(l)}(n)y_{i}^{(l-1)}(n)

Do not be discouraged by lots of mathematical mambo jumbo. It is actually quite simple. What the above says is:

The new weights for layer l are calculated by adding two things to the current weights. The first is the difference between the current weights and the previous weights multiplied by the coefficient we symbolize with α. This coefficient is called the momentum coefficient, and true to its name it adds speed to the training of any multi-layer perceptron by adding part of the already occurred weight changes to the current weight change. This is a double edged sword though since if your momentum constant is too large the network will not converge and it will probably get stuck in a local minima.

The other thing that adds to the weight change is the delta of the layer whose weights we change (l) multiplied by the outputs of the neurons of the previous layer (l-1) and all that multiplied by the constant η which we know to be the teaching step from the previous tutorial about the perceptron. And that is basically it! That’s what the multi-layer perceptron is all about. It is no doubt a very powerful neural network and a very powerful tool in statistical analysis.

#### Practical Example

It would not be a tutorial if we just explained how it works and gave you the equations. As was already mentioned the Multi-layer perceptron has many applications. Statistical analysis, pattern recognition, optical character recognition are just some of them. Our example will focus on just a simple instance of optical character recognition. Specifically the final program will be able to use an MLP to differentiate between a number of .bmp **monochrome** bitmap files and tell us which number each image depicts.I used 8×8 pixels resolution for the images but it is up to the reader to make his own resolutions and/or monochrome images since the program will read the size from the bitmap itself. Below you can see an example of such bitmaps.

They are ugly, right? Differentiating between them should be hard for a computer? This ugliness could be considered noice. And MLPs are really good at differentiating between noise and actual data that help it reach a conclusion. But let’s go on and see some code to understand how it is done.

class MLP { private: std::vector<float> inputNeurons; std::vector<float>> hiddenNeurons; std::vector<float> outputNeurons; std::vector<float> weights; FileReader* reader; int inputN,outputN,hiddenN,hiddenL; public: MLP(int hiddenL,int hiddenN); ~MLP(); //assigns values to the input neurons bool populateInput(int fileNum); //calculates the whole network, from input to output void calculateNetwork(); //trains the network according to our parameters bool trainNetwork(float teachingStep,float lmse,float momentum,int trainingFiles); //recalls the network for a given bitmap file void recallNetwork(int fileNum); }; |

The above is our multi-layer perceptron class. As you can see it has vectors for all the neurons and their connection weights. It also contains a FileReader object. As we will see below this FileReader is a class we will make to read the bitmap files to populate our input. The functions the MLP has are similar to the perceptron. It populates its input by reading the bitmap images, calculates an output for the network and trains the network. Moreover you can recall the network for a given ‘fileNum’ image to see what number the network thinks the image represents.

//Multi-layer perceptron constructor MLP::MLP(int hL,int hN) { //initialize the filereader reader = new FileReader(); outputN = 10; //the 9 possible numbers and zero hiddenL = hL; hiddenN = hN; //initialize the filereader reader = new FileReader(); //read the first image to see what kind of input will our net have inputN = reader->getBitmapDimensions(); if(inputN == -1) { printf("There was an error detecting img0.bmp\n\r"); return ; } //let's allocate the memory for the weights weights.reserve(inputN*hiddenN+(hiddenN*hiddenN*(hiddenL-1))+hiddenN*outputN); //also let's set the size for the neurons vector inputNeurons.resize(inputN); hiddenNeurons.resize(hiddenN*hiddenL); outputNeurons.resize(outputN); //randomize weights for inputs to 1st hidden layer for(int i = 0; i < inputN*hiddenN; i++) { weights.push_back( (( (float)rand() / ((float)(RAND_MAX)+(float)(1)) )) - 0.5 );//[-0.5,0.5] } //if there are more than 1 hidden layers, randomize their weights for(int i=1; i < hiddenL; i++) { for(int j = 0; j < hiddenN*hiddenN; j++) { weights.push_back( (( (float)rand() / ((float)(RAND_MAX)+(float)(1)) )) - 0.5 );//[-0.5,0.5] } } //and finally randomize the weights for the output layer for(int i = 0; i < hiddenN*outputN; i ++) { weights.push_back( (( (float)rand() / ((float)(RAND_MAX)+(float)(1)) )) - 0.5 );//[-0.5,0.5] } } |

The network takes the number of hidden neurons and hidden layers as parameters so it can know how to initialize its neurons and weights vectors. Moreover it reads the first bitmap, ‘img0.bmp’ to take the dimensions that all the images will have as can be seen from this line:

inputN = reader->getBitmapDimensions();

That is a requirement our tutorial’s program will have. You are free to provide any bitmap size you want for the first image ‘img0.bmp’ but you are required to have all the following images be of the same size. As in most neural networks the weights are initialized in the range between [-0.5,0.5].

void MLP::calculateNetwork() { //let's propagate towards the hidden layer for(int hidden = 0; hidden < hiddenN; hidden++) { hiddenAt(1,hidden) = 0; for(int input = 0 ; input < inputN; input ++) { hiddenAt(1,hidden) += inputNeurons.at(input)*inputToHidden(input,hidden); } //and finally pass it through the activation function hiddenAt(1,hidden) = sigmoid(hiddenAt(1,hidden)); } //now if we got more than one hidden layers for(int i = 2; i <= hiddenL; i ++) { //for each one of these extra layers calculate their values for(int j = 0; j < hiddenN; j++)//to { hiddenAt(i,j) = 0; for(int k = 0; k < hiddenN; k++)//from { hiddenAt(i,j) += hiddenAt(i-1,k)*hiddenToHidden(i,k,j); } //and finally pass it through the activation function hiddenAt(i,j) = sigmoid(hiddenAt(i,j)); } } int i; //and now hidden to output for(i =0; i < outputN; i ++) { outputNeurons.at(i) = 0; for(int j = 0; j < hiddenN; j++) { outputNeurons.at(i) += hiddenAt(hiddenL,j) * hiddenToOutput(j,i); } //and finally pass it through the activation function outputNeurons.at(i) = sigmoid( outputNeurons.at(i) ); } } |

The calculate network function just finds the output of the network that corresponds to the currently given input. It just propagates the input signals through each layer until they reach the output layer. Nothing really special with the above code, it is just an implementation of the equations that were presented above. The neural network of our tutorial as we saw in the constructor has 10 different output. Each of these output represent the possibility that the input pattern is a certain number. So, output 1 being close to 1.0 would mean that the input pattern is most certainly 1 and so on…

The training function is too big to just post it all in here, so I recommend you take a look at the .zip with the source code to see it in full. We will just focus in the implementation of the back-propagation algorithm.

for(int i = 0; i < outputN; i ++) { //let's get the delta of the output layer //and the accumulated error if(i != target) { outputDeltaAt(i) = (0.0 - outputNeurons[i])*dersigmoid(outputNeurons[i]); error += (0.0 - outputNeurons[i])*(0.0-outputNeurons[i]); } else { outputDeltaAt(i) = (1.0 - outputNeurons[i])*dersigmoid(outputNeurons[i]); error += (1.0 - outputNeurons[i])*(1.0-outputNeurons[i]); } } //we start propagating backwards now, to get the error of each neuron //in every layer //let's get the delta of the last hidden layer first for(int i = 0; i < hiddenN; i++) { hiddenDeltaAt(hiddenL,i) = 0;//zero the values from the previous iteration //add to the delta for each connection with an output neuron for(int j = 0; j < outputN; j ++) { hiddenDeltaAt(hiddenL,i) += outputDeltaAt(j) * hiddenToOutput(i,j) ; } //The derivative here is only because of the //delta rule weight adjustment about to follow hiddenDeltaAt(hiddenL,i) *= dersigmoid(hiddenAt(hiddenL,i)); } //now for each additional hidden layer, provided they exist for(int i = hiddenL-1; i >0; i--) { //add to each neuron's hidden delta for(int j = 0; j < hiddenN; j ++)//from { hiddenDeltaAt(i,j) = 0;//zero the values from the previous iteration for(int k = 0; k < hiddenN; k++)//to { //the previous hidden layers delta multiplied by the weights //for each neuron hiddenDeltaAt(i,j) += hiddenDeltaAt(i+1,k) * hiddenToHidden(i+1,j,k); } //The derivative here is only because of the //delta rule weight adjustment about to follow hiddenDeltaAt(i,j) *= dersigmoid(hiddenAt(i,j)); } } |

As you can see above this is the second pass over the network, the so called back-propagation as we presented it above, since we are going backwards this time. Having calculated the output and knowing the desired output (called target, in the above code) we start the delta calculation according to the equations that we saw at the start of the tutorial. If you don’t like math, then here it is for you in code. As you can see many helper macros are used to differentiate between weights of different layers and deltas.

//Weights modification tempWeights = weights;//keep the previous weights somewhere, we will need them //hidden to Input weights for(int i = 0; i < inputN; i ++) { for(int j = 0; j < hiddenN; j ++) { inputToHidden(i,j) += momentum*(inputToHidden(i,j) - _prev_inputToHidden(i,j)) + teachingStep* hiddenDeltaAt(1,j) * inputNeurons[i]; } } //hidden to hidden weights, provided more than 1 layer exists for(int i = 2; i <=hiddenL; i++) { for(int j = 0; j < hiddenN; j ++)//from { for(int k =0; k < hiddenN; k ++)//to { hiddenToHidden(i,j,k) += momentum*(hiddenToHidden(i,j,k) - _prev_hiddenToHidden(i,j,k)) + teachingStep * hiddenDeltaAt(i,k) * hiddenAt(i-1,j); } } } //last hidden layer to output weights for(int i = 0; i < outputN; i++) { for(int j = 0; j < hiddenN; j ++) { hiddenToOutput(j,i) += momentum*(hiddenToOutput(j,i) - _prev_hiddenToOutput(j,i)) + teachingStep * outputDeltaAt(i) * hiddenAt(hiddenL,j); } } prWeights = tempWeights; |

And finally this is the third and final pass over the network (for each image of course), which is a forward propagation from the input layer to the output layer. Here we use the previously calculated deltas to adjust the weights of the network, to make up for the error we found at the initial calculation. This is just an implementation in code of the weight adjustment equations we saw in the theoretical part of the tutorial.

We can see the teaching step at work here. Moreover the careful reader will have noticed that we keep the previous weight vector values in a temporary vector. That is because of the momentum. If you recall, we mentioned that the momentum adds a percentage of the already applied weight change to each subsequent weight change, achieving faster training speeds. Hence the term momentum.

Well that’s actually all there is to know about the back-propagation algorithm training and the Multi-layer perceptron. Let’s take a look at the fileReader class.

class FileReader { private: char* imgBuffer; //a DWORD char* check; bool firstImageRead; //the input filestream used to read ifstream fs; //image stuff int width; int height; public: FileReader(); ~FileReader(); bool readBitmap(int fileNum); //reads the first bitmap file, the one designated with a '0' //and gets the dimensions. All other .bmp are assumed with //equal and identical dimensions int getBitmapDimensions(); //returns a pointer to integers with all the goals //that each bitmap should have. Reads it from a file int* getImgGoals(); //returns a pointer to the currently read data char* getImgData(); //helper function convering bytes to an int int bytesToInt(char* bytes,int number); }; |

This is the fileReader, class. It contains the imgBuffer, which hold the data of the currently read bitmap, the input file stream used to read the bitmaps and it also keeps the width and height of the initializer image. Seeing how the functions are implemented is out of the scope of this tutorial but you can check the code in the .zip file to see how it is done. What you need to know is that this class will read the image designated as ‘img0.bmp’ and assume all the other images will be monochrome bitmaps with the same dimensions and that all are located in the same path as the executable.

By using any image editing program, even MS Windows Paint you are able to get monochrome bitmaps.. You can create your own bitmap images, and save them like that but just remember use incrementing numbers to name the files and update goals.txt accordingly. Moreover all images should have the same dimensions.

Assuming you have the image bitmaps AND the goals.txt file in the same directory as the executable you can run the tutorial like you can see in the above image. It is using the cmd command line in windows, but it should work fine in Linux too. You can see how it is called by looking at the above image. If you call it incorrectly you will be prompted for correct calling.

Any time during training (in Windows) and in Linux each 1000 epochs (for now, it is in the TODO list, to use the pdCurses library), you are able to stop and start recalling images. You are just prompted for the image number, the one coming after ‘img’ in the file name and the network recalls that image and tells you what it thinks that image represents. Afterwards as you can see from the image above you also get some percentages to know how much the network thinks the image match the numbers from 0 to 9.

Well this was it. I hope you enjoyed this tutorial and managed to comprehend the workings of the multi-layer perceptron neural network. You can find the source code and the images I used to train the network in the tutorial’s source code. I used really small dimensions , 8×8 , just so it can get trained fast. If you stick with the parameters I used above you are sure to converge. Since this network has many outputs, some of which look alike the mean square error can not go really low. That is since some numbers are almost the same, (especially the way I painted them). Specifically 7 with 4 , and 0 with 8. Still as far as picking the best matching pattern the network performs brilliantly. For least mean square error you can feel free to stop training when it goes below 0.45 or so.

As always if you have any comments about the tutorial, constructive criticism or found any bugs in the code please email me at lefteris *at* refu *dot* co

Hi,

Thank you for your tutorial. That’s helpful. Though I did not understand how you compute f'(u). The delta for the output layer is

delta = (target – impulse) * f'(impulse),

where target is the desired output and impulse is the neuron value.

It seems that you use f'(impulse) = dersigmoid(impulse) = impulse * (1 – impulse),

The sigmoid is sigmoid(x) = 1 / (1 + exp(-x)). How do you get f'(u)?

maybe……?

f(x) = sigmoid(x)

= (1+exp[-x]) ^ (-1)

f'(x) = -(1+exp[-x]) ^ (-2) * -exp[-x]

= (1+exp[-x])^(-2) * exp[-x]

= { (1+exp[-x]) ^ (-1) * exp[-x] } * { (1+exp[-x]) ^ (-1) }

= { 1 – f(x) } * f(x)

so, derivSigmoid = sigmoid * (1 – sigmoid)

HI

thank you for your tutorial.

i put the goal.txt and bitmaps in the same directory as executable file

but it doesn’t work .

how can i fix the problem ?

thank you for your tutorial.

I wrote this tutorial a very long time ago. A lot of things may have changed since then, regarding compiler versions and tools. You have to try and figure it out on your own.