BackpropagationBackward PropagationBlogdata scienceDeep LearningForward Propagationmachine learningNeural NetworkPython

Understand and Implement the Backpropagation Algorithm From Scratch In Python

It’s essential have clear understanding on learn how to implement a simple Neural Network from scratch. In this Understand and Implement the Backpropagation Algorithm From Scratch In Python tutorial we go through step by step means of understanding and implementing a Neural Community. We’ll begin from Linear Regression and use the similar concept to build a 2-Layer Neural Network.Then we’ll code a N-Layer Neural Community using python from scratch.As prerequisite, you should have primary understanding of Linear/Logistic Regression with Gradient Descent.

Let’s see how we will slowly move in the direction of building our first neural network.

Right here we’ve represented Linear Regression utilizing graphical format (Bias b shouldn’t be proven). As you see in the under diagram, we have now two enter options ( ( x_1, x_2) ). Z represents the linear mixture of the vectors w. The node with Z can be named as hidden unit, since X & Y are visible ( for training ) and Z is something outlined inside the mannequin.

We will write the equation for predicting values utilizing above linear regression as (this is shown utilizing blue arrow),
haty= z = b+x_1w_1+x_2w_2

So to be able to discover the greatest w, we need to first outline the value perform J.To use gradient descent, take by-product of the value perform J w.r.t w and b, then replace w and b by a fraction (learning price) of dw and db until convergence (this is proven utilizing purple arrow).
We will write dw and db as follows ( utilizing chain rule ).


And the gradient descent equation for updating w and b are,

W=: W-alpha fracdJdW
b=: b-alpha fracdJdb

In summary, first we predict the ( haty), then using this we calculate the value, after that utilizing gradient descent we modify the parameters of the mannequin. This happens in a loop and ultimately we study the greatest parameters (w and b ) for use in prediction. The above picture depicts the similar.

Right here we’ll attempt to symbolize Logistic Regression in the similar method. Mathematically Logistic regression is totally different than Linear Regression in two following methods:

  • Logistic Regression has a unique Value Perform J
  • Apply a non-linear transformation (Sigmoid) on Z to foretell chance of class label ( Binary Classification )

As you see in the under diagram the blue arrow signifies the Forward Propagation.

Listed here are the steps of Ahead Propagation in Logistic Regression. ( Matrix Format )

haty= A = sigma(Z)

The Gradient Descent ( a.okay.a Backpropagation ) in Logistic Regression has a further by-product to calculate.


The gradient descent equation for updating w and b will probably be precisely similar as Linear Regression (They’re similar for Neural Community too),

W=: W-alpha fracdJdW
b=: b-alpha fracdJdb

The process movement diagram is strictly the similar for Logistic Regression too.

We will say that Logistic Regression is a 1-Layer Neural Network. Now we’ll prolong the concept to a 2-Layer Neural Network.

Prolong the similar idea to a 2-Layer Neural Community. Refer the under diagram ( bias time period is just not displayed ). There are some minor notation modifications, corresponding to, the super-script now denotes the layer number. We have now added two extra hidden models to our mannequin. The vector W may have totally different dimension for each hidden layer.

In case you’re new to Neural Community, think about that the output of the first layer used as input to the next layer. Earlier in case of Logistic Regression we didn’t have a number of layers. These intermediate hidden layers offers a approach to clear up complicated tasks (a.okay.a non-linearity).

We will write the forward propagation in two steps as (Contemplate uppercase letters as Matrix).

Z^[1]=& W^[1]X+b^[1]
A^[1]=& sigma(Z^[1])
Z^[2]=& W^[2]A^[1]+b^[2]
haty=& A^[2]=sigma(Z^[2])

Again, identical to Linear and Logistic Regression gradient descent can be used to seek out the greatest W and b. The strategy is principally similar :

Define a price perform:

Take by-product (dw, db) of the value perform J w.r.t w and b.
Update w and b utilizing dw, db.

The back propagation has been shown in the above diagram using the purple arrows. Let’s discover the dw and db using chain rule. This may look difficult, nevertheless in the event you just comply with the arrows can you’ll be able to then easily correlate them with the equation.

dW^[2]=&fracdJdW^[2]=fracdJdA^[2]fracdA^[2]dZ^[2]fracdZ^[2]dW^[2] db^[2]=&fracdJdb^[2]=fracdJdA^[2]fracdA^[2]dZ^[2]fracdZ^[2]db^[2] dW^[1]=&fracdJdW^[2]=fracdJdA^[2]fracdA^[2]dZ^[2]fracdZ^[2]dA^[1]fracdA^[1]dZ^[1]fracdZ^[1]dW^[1] db^[1]=&fracdJdW^[2]=fracdJdA^[2]fracdA^[2]dZ^[2]fracdZ^[2]dA^[1]fracdA^[1]dZ^[1]fracdZ^[1]db^[1] finishalign

Finally, we’ll replace w and as following, ( similar as different algorithms)

W^[1]=: W^[1]-alpha fracdJdW^[1]
b^[1]=: b^[1]-alpha fracdJdb^[1]
W^[2]=: W^[2]-alpha fracdJdW^[2]
b^[2]=: b^[2]-alpha fracdJdb^[2] ]

As you see, technically the steps are similar for Linear Regression, Logistic Regression and Neural Community.

In Synthetic Neural Network the steps in the direction of the path of blue arrows is known as as Ahead Propagation and the steps in the direction of the purple arrows as Again-Propagation.


One main disadvantage of Backpropagation is computation complexity. Just for 2 layer Neural Network with 2 hidden unit in layer one, we already have pretty complicated equation to unravel. Think about the computation complexity for a community having 100’s of layers and 1000’s of hidden models in every layer. In order to unravel this drawback we will use dynamic programming.

The high degree concept is to precise the derivation of (dw^[l]) ( where l is the present layer) utilizing the already calculated values ( (dA^[l+1] , dZ^[l+1] and so on ) ) of layer l+1. In nutshell, this is named as Backpropagation Algorithm.

We’ll derive the Backpropagation algorithm for a 2-Layer Community and then will generalize for N-Layer Network.

Derivation of 2-Layer Neural Community:

For simplicity propose, let’s assume our 2-Layer Network only does binary classification. So the ultimate Hidden Layer can be using a Sigmoid Activation perform and our Value perform shall be simply the Binary Cross Entropy Error Perform utilized in Logistic Regression. The Activation perform of the remaining hidden layer might be anything.

Why the above assumptions are necessary:

Since the Backpropagation starts from taking by-product of the value/error perform, the derivation can be totally different if we are utilizing a unique activation perform resembling Softmax (at the last hidden layer solely). Softmax can be utilized for MultiClass Classification, I’ll have a separate submit for that.

I can be referring the diagram above, which I drew to point out the Ahead and Backpropagation of the 2-Layer Community. So that you simply don’t should scroll up and down, I’m having the similar diagram here again.

Our first objective is to seek out ( fracdJdW^[2] ) the place J is the value perform and ( W^[2] ) is a matrix of all the weights in the remaining layer. Using partial derivates we will outline the following ( comply with the path (purple colour) of the Backpropagation in the image above in case you are confused )

fracdJdW^[2] = fracdJdA^[2]fracdA^[2]dZ^[2]fracdZ^[2]dW^[2] ]

Our Cross Entropy Error Perform for binary classification is :

J= – frac1n bigg( Ylog left ( A^[2] proper ) – left ( 1-Y proper )log left ( 1 – A^[2] proper ) bigg)

Keep in mind, in the above equation ( a^[2] ) is nothing but ( haty )

Now we will define our ( fracdJdW^[2] ) as,

fracdJdW^[2] = Bigg[-fracYA^[-fracYA^[-fracYA^[-fracYA^[2] + frac1-Y1- A^[2] Bigg]Bigg[A^[A^[A^[A^[2] (1- A^[2])Bigg]Bigg[A^[A^[A^[A^[2]Bigg] ]

Let’s take a minute and understand what simply occurred right here. The 1st half is the by-product of the Value Perform. So long as you understand the derivate of log, you’ll be able to see how this is sensible. ( I’ve omitted the 1/n issue right here, we’ll ignore that for now, nevertheless throughout coding we’ll ensure that to divide the end result by n )

The 2nd part is the by-product of the Sigmoid activation perform. Once more, you possibly can derive it by your self just by figuring out the derivate of ( e^x ) w.r.t x.

We already know ( Z^[2]) from our ahead propagation,
Z^[2] = W^[2] A^[1] + b^[2]]

The by-product of the above ( Z^[2]) w.r.t ( W^[2] ) will merely be ( A^[1] ).

Simplifying the equation, we get

fracdJdW^[2] &= Bigg[-Y+cancelYA^[-Y+cancelYA^[-Y+cancelYA^[-Y+cancelYA^[2] + A^[2] – cancelYA^[2] Bigg]Bigg[A^[A^[A^[A^[2]Bigg] &=Bigg[A^[A^[A^[A^[2] – YBigg]Bigg[A^[A^[A^[A^[2]Bigg] &= dZ^[2] A^[2] endalign

Simply observe that, (we’ll use this later)
dZ^[2] = fracdJdZ^[2] = fracdJdA^[2]fracdA^[2]dZ^[2] = Bigg[A^[A^[A^[A^[2] – YBigg]]

Equally we will define ( fracdJdb^[2] ) as,

fracdJdb^[2] &= fracdJdA^[2]fracdA^[2]dZ^[2]fracdZ^[2]db^[2]
&=Bigg[A^[A^[A^[A^[2] – YBigg]Bigg[ 1 Bigg]
&=Bigg[A^[A^[A^[A^[2] – YBigg] &=dZ^[2] finishalign

We’ll now move to the first layer, (following the purple arrows in the image)

fracdJdW^[1] &= fracdJdA^[2]fracdA^[2]dZ^[2]fracdZ^[2]dA^[1] fracdA^[1]dZ^[1]fracdZ^[1]dW^[1] &= fracdJdZ^[2]fracdZ^[2]dA^[1] fracdA^[1]dZ^[1]fracdZ^[1]dW^[1] &= Bigg[A^[A^[A^[A^[2] – YBigg]Bigg[W^[W^[W^[W^[2] Bigg]Bigg[g’left(Z^[g’left(Z^[g’left(Z^[g’left(Z^[1] proper ) Bigg]Bigg[ A^[0]Bigg] &= dZ^[2] W^[2] g’left ( Z^[1] proper ) A^[0]
& = dZ^[1] A^[0] endalign

There are few points to note.

  • First is reusability, the entire goal of dynamic programming is find out how to reuse already computed values in future computation. Thats the purpose we’re reusing ( dZ^[2] ).
  • (A^[0]) here is nothing however our input X, nevertheless when you have greater than 2 hidden layer, it should simply be the activation output of the earlier later.
  • We will generalize this by equation for any layer apart from the remaining hidden layer (The final layer equation is determined by the Activaition of that layer).

Also want the following,
fracdJdA^[1] &= fracdJdA^[2]fracdA^[2]dZ^[2]fracdZ^[2]dA^[1]
&= fracdJdZ^[2]W^[2]
&= dZ^[2]W^[2]finishalign

Similar for ( db^[1] )

fracdJdb^[1] &= fracdJdA^[2]fracdA^[2]dZ^[2]fracdZ^[2]dA^[1] fracdA^[1]dZ^[1]fracdZ^[1]db^[1] &= fracdJdZ^[2]fracdZ^[2]dA^[1] fracdA^[1]dZ^[1]fracdZ^[1]db^[1] &= Bigg[A^[A^[A^[A^[2] – YBigg]Bigg[W^[W^[W^[W^[2] Bigg]Bigg[g’left(Z^[g’left(Z^[g’left(Z^[g’left(Z^[1] right ) Bigg]Bigg[ 1Bigg]
&= dZ^[2] W^[2] g’left ( Z^[1] proper )
& = dZ^[1] finishalign

Since we now have the required derivatives, ( dW^[2], db^[2], dW^[1], db^[1]), it’s time that we define the full algorithm.

N-Layer Neural Community Algorithm :

We’ll now define the full algorithm of a N-Layer Neural Network Algorithm by generalizing the equations we’ve got derived for our 2-Layer Network.

& bullet textInitialize W^[1] .. W^[L], b^[1] … b^[L]
& bullet textual contentSet A^[0] = X text ( Enter ) , L = textual contentComplete Layers
& bullet textLoop textepoch = 1 textual content to text max iteration
& rule1cm0pt bullet textual contentForward Propagation
& rule2cm0pt bullet textLoop l=1 text to L-1
& rule3cm0pt bullet Z^[l] = W^[l]A^[l-1]+b^[l]
& rule3cm0pt bullet A^[l] = g left ( b^[l] proper )
& rule3cm0pt bullet textual contentSave A^[l],W^[l] textual content in memory for later use
& rule2cm0pt bullet Z^[L] = W^[L]A^[L-1]+b^[L]
& rule2cm0pt bullet A^[L] = sigma left ( Z^[L] right )
& rule1cm0pt bullet textual contentValue J= – frac1n bigg( Ylog left ( A^[2] right ) – left ( 1-Y proper )log left ( 1 – A^[2] right ) bigg)
& rule1cm0pt bullet textual contentBackward Propagation
& rule2cm0pt bullet dA^[L] = -fracYA^[L] + frac1-Y1- A^[L]
& rule2cm0pt bullet dZ^[L] = dA^[L] sigma’left ( dA^[L] proper )
& rule2cm0pt bullet dW^[L] = dZ^[L] dA^[L-1]
& rule2cm0pt bullet db^[L] = dZ^[L]
& rule2cm0pt bullet dA^[L-1] = dZ^[L] W^[L]
& rule2cm0pt bullet textual contentLoop l=L-1 textual content to 1
& rule3cm0pt bullet dZ^[l] = dA^[l] g’left ( dA^[l] proper )
& rule3cm0pt bullet dW^[l] = dZ^[l] dA^[l-1]
& rule3cm0pt bullet db^[l] = dZ^[l]
& rule3cm0pt bullet dA^[l-1] = dZ^[l] W^[l]
& rule1cm0pt bullet textReplace W and b
& rule2cm0pt bullet textLoop l=1 textual content to L
& rule3cm0pt bullet W^[l] =W^[l] -alpha . dW^[l]
& rule3cm0pt bullet b^[l] =b^[l] -alpha . db^[l] endalign

The algorithm above is straightforward to know. Simply the generalized model of our previous derivation. Really feel payment to ask me query in the feedback section in case you have got any doubt.

Python Implementation:

At this point technically we will immediately leap into the code, nevertheless you’ll certainly have issues with matrix dimension. Therefore, let’s ensure that we absolutely understand the matrix dimensions earlier than coding. Once you do this coding must be very simple.

We’ll use MNIST dataset for our implementation.( You’ll be able to google in case you’re hearing about this dataset to know extra about it. )
MNIST has 6000 28×28 dimension gray scale image as coaching and complete 10 totally different class, nevertheless since we shall be specializing in binary classification here, we’ll choose all photographs with label 5 and eight (Complete 11272). We’ll write a perform which can return the knowledge we’d like.

Each pixel shall be a function for us, so we’ll first flatten every image to 28×28 = 784 vector. The enter dimension shall be 11272 X 784.

In our Neural Network we may have complete 2 layers, so will probably be like 784 (input Layer)->196->1.

Forward Propagation – Layer 1:

X &= left ( 11272,784 proper )
W^[1] &=left ( 196, 784 proper )
b^[1] &=left ( 196, 1 right )
A^[0] &= X^T
&=left ( 784,11272 proper )
Z^[1] &=W^[1]A^[0]+b^[1]
&= left ( 196,784 right ) * left ( 784,11272 right ) + left ( 196, 1 right )
&= left ( 196,11272 proper ) + left ( 196, 1 right )
&= left ( 196,11272 right )
A^[1] &=gleft ( Z^[1] proper )
&=left ( 196,11272 right )

Ahead Propagation – Layer 2:

W^[2] &=left ( 1, 196 proper )
b^[2] &=left ( 1, 1 proper )
Z^[2] &=W^[2]A^[1]+b^[2]
&= left ( 1, 196 right ) * left ( 196,11272 right ) + left ( 1, 1 right )
&= left ( 1,11272 proper ) + left ( 1, 1 proper )
&= left ( 1,11272 right )
A^[2] &=gleft ( Z^[2] proper )
&=left ( 1,11272 proper )

Backward Propagation – Layer 2:

Y^T &= left ( 1, 11272 proper )
dA^[2] &=-fracY^TA^[2] + frac1-Y^T1- A^[2]
&=left ( 1, 11272 proper )
dZ^[2] &=dA^[2] g'(Z^[2])
&= left ( 1, 11272 right ) * left ( 1, 11272 right )
&= left ( 1, 11272 right )
dW^[2] &=dZ^[2] left ( A^[1] right )^T
&=left ( 1, 11272 proper ) * left ( 11272,196 proper )
&= left ( 1, 196 right )
db^[2] &=dZ^[2]
&= left ( 1, 1 right )
dA^[1] &= left ( W^[2] proper )^T dZ^[2]
&=left ( 196,1 right ) * left ( 1, 11272 right )
&=left ( 196, 11272 proper )

Backward Propagation – Layer 1:

dZ^[1] &=dA^[1] g'(Z^[1])
&= left ( 196, 11272 proper ) * left ( 196,11272 proper )
&= left ( 196, 11272 proper )
dW^[1] &=dZ^[1] left ( A^[0] right )^T
&=left ( 196, 11272 right ) * left ( 11272, 784 right )
&= left ( 196, 784proper )
db^[1] &=dZ^[1]
&= left ( 196, 1 right )

Two essential points:

  • I haven’t absolutely defined the calculation for b above. We’d like have to sum over all the rows to ensure the dimension of (b^[l]) and (db^[l]) matches. We’ll use numpy’s axis=1 and keepdims=True choice for this.
  • We now have utterly ignore the divide by n calculation (It was a part of our value perform). In order a apply, every time we are calculating the by-product of W and b, we’ll divide the outcome by n.

We shall be utilizing a python library to load the MNIST knowledge. It just helps us to concentrate on the algorithm. You possibly can set up it by operating following command.

We’ll create a class named ANN and have the following strategies outlined there.

We’ll get the knowledge then preprocess it and invoke our ANN class.Our essential will appear to be this. Also we should always be capable of move the number of layers we’d like in our model. We dont need to repair the variety of layers, relatively need to move that as an array to our ANN class.

The get_binary_dataset() perform above will present the Practice and Check knowledge. The dimension of the knowledge shall be as we’ve seen above. In the pre_process_data() perform we’ll simply normalize the knowledge.

Under is the constructor of the ANN class. Right here the layer measurement shall be handed as an array.The self.parameters will probably be a dictonary object the place we maintain all the W and b.

The match() perform will first call initialize_parameters() to create all the essential W and b for each layer.Then we may have the training operating in n_iterations occasions. Inside the loop first name the forward() perform. Then calculate the value and name the backward() perform. Afterwards, we’ll update the W and b for all the layers.

Since the W1 parameter wants the variety of options present in the coaching knowledge, we’ll insert that in the layers_size array before invoking initialize_parameters()

In the initialize_parameters() perform we loop by means of the layers_size array and retailer the parameters in the self.parameters dictionary.

Once you run the code the self.parameters variable will appear to be this:

The ahead() perform could be very straightforward to know. Although we’re using Sigmoid Activation perform in all the layers, we may have the calculation for the last layer outdoors of the loop in order that we will simply plugin a Softmax perform there (Softmax is just not coated on this tutorial).

We will even create a new retailer dictionary object and hold the A,W and Z for each layer so that we will use them throughout backpropagation.

Above in line 18, returned value A is principally the ( haty).

In the backward() perform like we now have in the derivation, first calculate the dA,dW,db for the L’th layer and then in the loop discover all the derivatives for remaining layers.

The under code is the similar as the derivations we went by means of earlier. We hold all the derivatives in the derivatives dictionary and return that to the fit() perform.

Here is the code for the sigmoid() and sigmoid_derivative() perform. In a later tutorial we’ll see learn how to use ReLu and Softmax.

In the predict() perform we wil simply use the present W and b and compute the chance usimng ahead() perform. Then we’ll convert the chance to a predicted class 0 or 1.

Let’s take a look at the outout. You’ll get around 96% Practice and Check Accuracy.

The fee steadily does down as we run multiple iteration.

The most effective part of writing the code in a generic method is we will easily attempt using totally different layer measurement. Let’s attempt the following:

Here is the end result.

Naturally, with the similar knowledge, iteration and learning price the bigger Community is performing poorly than the smaller one. In case you have been anticipating a special outcome then let me know in the comment part and we will talk about about it.

Under is the full code of the ANN class:

You’ll be able to entry the full challenge here:

I hope that this tutorial supplies a detail view on backpropagation algorithm. Since backpropagation is the backbone of any Neural Network, it’s essential to know in depth. We will make many optimization from this level onwards for enhancing the accuracy, quicker computation and so forth. Next we’ll see how one can implement the similar using each Tensorflow and PyTorch.

Under are the articles on implementing the Neural Community utilizing TensorFlow and PyTorch.

  1. Implement Neural Network utilizing TensorFlow
  2. Implement Neural Network using PyTorch

Related posts
accessibilityBlogbrowserChromeedgeFirefoxInternet ExplorersafariusabilityUXWeb

Group Labels Do Not Guarantee… Uniquity?

AdoreboardanalyticsBlogcustomer centricitycustomer experienceCXdata scienceemotion analysismarginal gains

Are Marginal Gains the Answer to Measuring Customer Experience? – Adoreboard Blog


Old Forge’s new passion: mountain biking


David Gibson On APA Appointments; Role of Statewide Interests –