Multi-layer Perceptron Tutorial

This post continues the neural network tutorials series and is a direct continuation of the perceptron tutorial. We will see what is a multi-layer perceptron neural network, why is it so powerful and how we can implement one. Source code of an implementation is also provided along with a small toy application example.

Structure of a Multilayer perceptron

This tutorial continues from the last neural network tutorial, the Perceptron tutorial. We will now introduce the structure of the multi-layer perceptron and the back-propagation algorithm, without doubt the most popular neural network structure to date. If you are in a hurry and just want to mess with the code you can get it from here but I would recommend reading on to see how the network functions.

Tutorial Prerequisites

  • The reader should be familiar with the perceptron neural network
  • The reader should have a basic understanding of C/C++
  • The reader should know how to compile and run a program from the command line in Windows or Linux

Tutorial goals

  • The reader will understand the structure of the Multi-layer Perceptron neural network
  • The reader will understand the back-propagation algorithm
  • The reader will know about the wide array of applications this network is used in
  • The reader will learn all the above via an actual practical application in optical character recognition

Tutorial Body

This network was introduced around 1986 with the advent of the back-propagation algorithm. Until then there was no rule via which we could train neural networks with more than one layer. As the name implies, a Multi-layer Perceptron is just that, a network that is comprised of many neurons, divided in layers. These layers are divided as follows:

Structure of a Multilayer perceptron
Structure of a Multilayer perceptron
  • The input layer, where the input of the network goes. The number of neurons here depends on the number of inputs we want our network to get
  • One or more hidden layers. These layers come between the input and the output and their number can vary. The function that the hidden layer serves is to encode the input and map it to the output. It has been proven that a multi-layer perceptron with only one hidden layer can approximate any function that connects its input with its outputs if such a function exists.
  • The output layer, where the outcome of the network can be seen. The number of neurons here depends on the problem we want the neural net to learn

The Multi-layer perceptron differs from the simple perceptron in many ways. The same part is that of weight randomization. All weights are given random values between a certain range, usually [-0.5,0.5]. Having that aside though, for each pattern that is fed to the network three passes over the net are made. Let’s see them one by one in detail.

Calculating the output

In this phase we calculate the output of the network. For each layer, we calculate the firing value of each neuron by getting the sum of the products of the multiplications of all the neurons connected to said neuron from the previous layer and their corresponding weights. That sounded a little big though so here it is in pseudocode:

    for(int i = 0; i < previousLayerNeurons; i ++)
    value[neuron,layer] += weight(i,neuron) * value[i,layer-1];
    value[neuron,layer] = activationFunction(value[neuron,layer]);

As can be seen from the pseudocode, here too we have activation functions. They are used to normalize the output of each neuron and the functions that are most commonly used in the perceptron apply here too.So, we gradually propagate forward in the network until we reach the output layer, and create some output values. Just like the perceptron these values are initially completely random and have nothing to do with our goal values. But it is here that the back-propagation learning algorithm kicks in.

Back propagation

The back propagation learning algorithm uses the delta-rule. What this does is that it computes the deltas, (local gradients) of each neuron starting from the output neurons and going backwards until it reaches the input layer. To compute the deltas of the output neurons though we first have to get the error of each output neuron. That’s pretty simple, since the multi-layer perceptron is a supervised training network so the error is the difference between the network’s output and the desired output.

ej(n) = dj(n) – oj(n)

where e(n) is the error vector, d(n) is the desired output vector and o(n) is the actual output vector. Now to compute the deltas:

deltaj(L)(n) = ej(L)(n) * f'(uj(L)(n)) , for neuron j in the output layer L

where f'(uj(L)(n)) is the derivative of the value of the jth neuron of layer L

deltaj(l)(n) = f'(uj(l)(n)) Σk(deltak(l+1)(n)*wkj(l+1)(n)) , for neuron j in hidden layer l

where f'(uj(l)(n)) is the derivative of the value of the jth neuron in layer l and inside the Sum we have the products of all the deltas of the neurons of the next layer multiplied by their corresponding weights.

This part is a very important part of the delta rule and the whole essence of back propagation. Why you might ask? Because as high school math teaches us, a derivative is how much a function changes as its input changes. By propagating the derivatives backwards , we are informing all the neurons in the previous layers of the change that is needed in our weights to match the desired output. And all that starts from the initial error calculation at the output layer. Just like magic!

Weight adjustment

Having calculated the deltas for all the neurons we are now ready for the third and final pass of the network, this time to adjust the weights according to the generalized delta rule:

wji(l)(n+1) = wji(l)(n) + α * [wji(l)(n) – wji(l)(n-1)] + η * deltaj(l)(n)yi(l-1)(n)

Do not be discouraged by lots of mathematical mambo jumbo. It is actually quite simple. What the above says is:

The new weights for layer l are calculated by adding two things to the current weights. The first is the difference between the current weights and the previous weights multiplied by the coefficient we symbolize with α. This coefficient is called the momentum coefficient, and true to its name it adds speed to the training of any multi-layer perceptron by adding part of the already occurred weight changes to the current weight change. This is a double edged sword though since if your momentum constant is too large the network will not converge and it will probably get stuck in a local minima.

The other thing that adds to the weight change is the delta of the layer whose weights we change (l) multiplied by the outputs of the neurons of the previous layer (l-1) and all that multiplied by the constant η which we know to be the teaching step from the previous tutorial about the perceptron. And that is basically it! That’s what the multi-layer perceptron is all about. It is no doubt a very powerful neural network and a very powerful tool in statistical analysis.

Practical Example

It would not be a tutorial if we just explained how it works and gave you the equations. As was already mentioned the Multi-layer perceptron has many applications. Statistical analysis, pattern recognition, optical character recognition are just some of them. Our example will focus on just a simple instance of optical character recognition. Specifically the final program will be able to use an MLP to differentiate between a number of .bmp monochrome bitmap files and tell us which number each image depicts.I used 8×8 pixels resolution for the images but it is up to the reader to make his own resolutions and/or monochrome images since the program will read the size from the bitmap itself. Below you can see an example of such bitmaps.


They are ugly, right? Differentiating between them should be hard for a computer? This ugliness could be considered noice. And MLPs are really good at differentiating between noise and actual data that help it reach a conclusion. But let’s go on and see some code to understand how it is done.

    class MLP
    std::vector&lt;float&gt; inputNeurons;
    std::vector&lt;float&gt;> hiddenNeurons;
    std::vector&lt;float&gt; outputNeurons;
    std::vector&lt;float&gt; weights;
    FileReader* reader;
    int inputN,outputN,hiddenN,hiddenL;
    MLP(int hiddenL,int hiddenN);
    //assigns values to the input neurons
    bool populateInput(int fileNum);
    //calculates the whole network, from input to output
    void calculateNetwork();
    //trains the network according to our parameters
    bool trainNetwork(float teachingStep,float lmse,float momentum,int trainingFiles);
    //recalls the network for a given bitmap file
    void recallNetwork(int fileNum);

The above is our multi-layer perceptron class. As you can see it has vectors for all the neurons and their connection weights. It also contains a FileReader object. As we will see below this FileReader is a class we will make to read the bitmap files to populate our input. The functions the MLP has are similar to the perceptron. It populates its input by reading the bitmap images, calculates an output for the network and trains the network. Moreover you can recall the network for a given ‘fileNum’ image to see what number the network thinks the image represents.

    //Multi-layer perceptron constructor
    MLP::MLP(int hL,int hN)
    //initialize the filereader
    reader = new FileReader();
    outputN = 10; //the 9 possible numbers and zero
    hiddenL = hL;
    hiddenN = hN;
    //initialize the filereader
    reader = new FileReader();
    //read the first image to see what kind of input will our net have
    inputN = reader->getBitmapDimensions();
    if(inputN == -1)
    printf("There was an error detecting img0.bmp\n\r");
    return ;
    //let's allocate the memory for the weights
    //also let's set the size for the neurons vector
    //randomize weights for inputs to 1st hidden layer
    for(int i = 0; i < inputN*hiddenN; i++)
    weights.push_back( (( (float)rand() / ((float)(RAND_MAX)+(float)(1)) )) - 0.5 );//[-0.5,0.5]
    //if there are more than 1 hidden layers, randomize their weights
    for(int i=1; i < hiddenL; i++)
    for(int j = 0; j < hiddenN*hiddenN; j++)
    weights.push_back( (( (float)rand() / ((float)(RAND_MAX)+(float)(1)) )) - 0.5 );//[-0.5,0.5]
    //and finally randomize the weights for the output layer
    for(int i = 0; i < hiddenN*outputN; i ++)
    weights.push_back( (( (float)rand() / ((float)(RAND_MAX)+(float)(1)) )) - 0.5 );//[-0.5,0.5]

The network takes the number of hidden neurons and hidden layers as parameters so it can know how to initialize its neurons and weights vectors. Moreover it reads the first bitmap, ‘img0.bmp’ to take the dimensions that all the images will have as can be seen from this line:

inputN = reader->getBitmapDimensions();

That is a requirement our tutorial’s program will have. You are free to provide any bitmap size you want for the first image ‘img0.bmp’ but you are required to have all the following images be of the same size. As in most neural networks the weights are initialized in the range between [-0.5,0.5].

    void MLP::calculateNetwork()
    //let's propagate towards the hidden layer
    for(int hidden = 0; hidden < hiddenN; hidden++)
    hiddenAt(1,hidden) = 0;
    for(int input = 0 ; input < inputN; input ++)
    hiddenAt(1,hidden) +=*inputToHidden(input,hidden);
    //and finally pass it through the activation function
    hiddenAt(1,hidden) = sigmoid(hiddenAt(1,hidden));
    //now if we got more than one hidden layers
    for(int i = 2; i <= hiddenL; i ++)
    //for each one of these extra layers calculate their values
    for(int j = 0; j < hiddenN; j++)//to
    hiddenAt(i,j) = 0;
    for(int k = 0; k < hiddenN; k++)//from
    hiddenAt(i,j) += hiddenAt(i-1,k)*hiddenToHidden(i,k,j);
    //and finally pass it through the activation function
    hiddenAt(i,j) = sigmoid(hiddenAt(i,j));
    int i;
    //and now hidden to output
    for(i =0; i < outputN; i ++)
    { = 0;
    for(int j = 0; j < hiddenN; j++)
    { += hiddenAt(hiddenL,j) * hiddenToOutput(j,i);
    //and finally pass it through the activation function = sigmoid( );

The calculate network function just finds the output of the network that corresponds to the currently given input. It just propagates the input signals through each layer until they reach the output layer. Nothing really special with the above code, it is just an implementation of the equations that were presented above. The neural network of our tutorial as we saw in the constructor has 10 different output. Each of these output represent the possibility that the input pattern is a certain number. So, output 1 being close to 1.0 would mean that the input pattern is most certainly 1 and so on…

The training function is too big to just post it all in here, so I recommend you take a look at the .zip with the source code to see it in full. We will just focus in the implementation of the back-propagation algorithm.

    for(int i = 0; i < outputN; i ++)
    //let's get the delta of the output layer
    //and the accumulated error
    if(i != target)
    outputDeltaAt(i) = (0.0 - outputNeurons[i])*dersigmoid(outputNeurons[i]);
    error += (0.0 - outputNeurons[i])*(0.0-outputNeurons[i]);
    outputDeltaAt(i) = (1.0 - outputNeurons[i])*dersigmoid(outputNeurons[i]);
    error += (1.0 - outputNeurons[i])*(1.0-outputNeurons[i]);
    //we start propagating backwards now, to get the error of each neuron
    //in every layer
    //let's get the delta of the last hidden layer first
    for(int i = 0; i < hiddenN; i++)
    hiddenDeltaAt(hiddenL,i) = 0;//zero the values from the previous iteration
    //add to the delta for each connection with an output neuron
    for(int j = 0; j < outputN; j ++)
    hiddenDeltaAt(hiddenL,i) += outputDeltaAt(j) * hiddenToOutput(i,j) ;
    //The derivative here is only because of the
    //delta rule weight adjustment about to follow
    hiddenDeltaAt(hiddenL,i) *= dersigmoid(hiddenAt(hiddenL,i));
    //now for each additional hidden layer, provided they exist
    for(int i = hiddenL-1; i >0; i--)
    //add to each neuron's hidden delta
    for(int j = 0; j < hiddenN; j ++)//from
    hiddenDeltaAt(i,j) = 0;//zero the values from the previous iteration
    for(int k = 0; k < hiddenN; k++)//to
    //the previous hidden layers delta multiplied by the weights
    //for each neuron
    hiddenDeltaAt(i,j) += hiddenDeltaAt(i+1,k) * hiddenToHidden(i+1,j,k);
    //The derivative here is only because of the
    //delta rule weight adjustment about to follow
    hiddenDeltaAt(i,j) *= dersigmoid(hiddenAt(i,j));

As you can see above this is the second pass over the network, the so called back-propagation as we presented it above, since we are going backwards this time. Having calculated the output and knowing the desired output (called target, in the above code) we start the delta calculation according to the equations that we saw at the start of the tutorial. If you don’t like math, then here it is for you in code. As you can see many helper macros are used to differentiate between weights of different layers and deltas.

    //Weights modification
    tempWeights = weights;//keep the previous weights somewhere, we will need them
    //hidden to Input weights
    for(int i = 0; i < inputN; i ++)
    for(int j = 0; j < hiddenN; j ++)
    inputToHidden(i,j) += momentum*(inputToHidden(i,j) - _prev_inputToHidden(i,j)) +
    teachingStep* hiddenDeltaAt(1,j) * inputNeurons[i];
    //hidden to hidden weights, provided more than 1 layer exists
    for(int i = 2; i <=hiddenL; i++)
    for(int j = 0; j < hiddenN; j ++)//from
    for(int k =0; k < hiddenN; k ++)//to
    hiddenToHidden(i,j,k) += momentum*(hiddenToHidden(i,j,k) - _prev_hiddenToHidden(i,j,k)) +
    teachingStep * hiddenDeltaAt(i,k) * hiddenAt(i-1,j);
    //last hidden layer to output weights
    for(int i = 0; i < outputN; i++)
    for(int j = 0; j < hiddenN; j ++)
    hiddenToOutput(j,i) += momentum*(hiddenToOutput(j,i) - _prev_hiddenToOutput(j,i)) +
    teachingStep * outputDeltaAt(i) * hiddenAt(hiddenL,j);
    prWeights = tempWeights;

And finally this is the third and final pass over the network (for each image of course), which is a forward propagation from the input layer to the output layer. Here we use the previously calculated deltas to adjust the weights of the network, to make up for the error we found at the initial calculation. This is just an implementation in code of the weight adjustment equations we saw in the theoretical part of the tutorial.

We can see the teaching step at work here. Moreover the careful reader will have noticed that we keep the previous weight vector values in a temporary vector. That is because of the momentum. If you recall, we mentioned that the momentum adds a percentage of the already applied weight change to each subsequent weight change, achieving faster training speeds. Hence the term momentum.

Well that’s actually all there is to know about the back-propagation algorithm training and the Multi-layer perceptron. Let’s take a look at the fileReader class.

    class FileReader
    char* imgBuffer;
    //a DWORD
    char* check;
    bool firstImageRead;
    //the input filestream used to read
    ifstream fs;
    //image stuff
    int width;
    int height;
    bool readBitmap(int fileNum);
    //reads the first bitmap file, the one designated with a '0'
    //and gets the dimensions. All other .bmp are assumed with
    //equal and identical dimensions
    int getBitmapDimensions();
    //returns a pointer to integers with all the goals
    //that each bitmap should have. Reads it from a file
    int* getImgGoals();
    //returns a pointer to the currently read data
    char* getImgData();
    //helper function convering bytes to an int
    int bytesToInt(char* bytes,int number);

This is the fileReader, class. It contains the imgBuffer, which hold the data of the currently read bitmap, the input file stream used to read the bitmaps and it also keeps the width and height of the initializer image. Seeing how the functions are implemented is out of the scope of this tutorial but you can check the code in the .zip file to see how it is done. What you need to know is that this class will read the image designated as ‘img0.bmp’ and assume all the other images will be monochrome bitmaps with the same dimensions and that all are located in the same path as the executable.

By using any image editing program, even MS Windows Paint you are able to get monochrome bitmaps.. You can create your own bitmap images, and save them like that but just remember use incrementing numbers to name the files and update goals.txt accordingly. Moreover all images should have the same dimensions.

How to use the executable
How to use the executable

Assuming you have the image bitmaps AND the goals.txt file in the same directory as the executable you can run the tutorial like you can see in the above image. It is using the cmd command line in windows, but it should work fine in Linux too. You can see how it is called by looking at the above image. If you call it incorrectly you will be prompted for correct calling.

Recalling the mlp

Any time during training (in Windows) and in Linux each 1000 epochs (for now, it is in the TODO list, to use the pdCurses library), you are able to stop and start recalling images. You are just prompted for the image number, the one coming after ‘img’ in the file name and the network recalls that image and tells you what it thinks that image represents. Afterwards as you can see from the image above you also get some percentages to know how much the network thinks the image match the numbers from 0 to 9.

Well this was it. I hope you enjoyed this tutorial and managed to comprehend the workings of the multi-layer perceptron neural network. You can find the source code and the images I used to train the network in the tutorial’s source code. I used really small dimensions , 8×8 , just so it can get trained fast. If you stick with the parameters I used above you are sure to converge. Since this network has many outputs, some of which look alike the mean square error can not go really low. That is since some numbers are almost the same, (especially the way I painted them). Specifically 7 with 4 , and 0 with 8. Still as far as picking the best matching pattern the network performs brilliantly. For least mean square error you can feel free to stop training when it goes below 0.45 or so.

As always if you have any comments about the tutorial, constructive criticism or found any bugs in the code please email me at lefteris *at* refu *dot* co

4 thoughts on “Multi-layer Perceptron Tutorial”

  1. Hi,
    Thank you for your tutorial. That’s helpful. Though I did not understand how you compute f'(u). The delta for the output layer is
    delta = (target – impulse) * f'(impulse),
    where target is the desired output and impulse is the neuron value.
    It seems that you use f'(impulse) = dersigmoid(impulse) = impulse * (1 – impulse),
    The sigmoid is sigmoid(x) = 1 / (1 + exp(-x)). How do you get f'(u)?

  2. maybe……?

    f(x) = sigmoid(x)
    = (1+exp[-x]) ^ (-1)

    f'(x) = -(1+exp[-x]) ^ (-2) * -exp[-x]
    = (1+exp[-x])^(-2) * exp[-x]
    = { (1+exp[-x]) ^ (-1) * exp[-x] } * { (1+exp[-x]) ^ (-1) }
    = { 1 – f(x) } * f(x)

    so, derivSigmoid = sigmoid * (1 – sigmoid)

  3. HI
    thank you for your tutorial.
    i put the goal.txt and bitmaps in the same directory as executable file
    but it doesn’t work .
    how can i fix the problem ?

    thank you for your tutorial.

  4. I wrote this tutorial a very long time ago. A lot of things may have changed since then, regarding compiler versions and tools. You have to try and figure it out on your own.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.