Backpropagation algorithm
The backpropagation algorithm is essential for training large neural networks
quickly. This article explains how the algorithm works.
Please scroll down...
Simple neural network
On the right, you see a neural network with one input, one output node
and two hidden layers of two nodes each.
Nodes in neighboring layers are connected with weights 𝑤𝑖𝑗, which are the network
parameters.
Activation function
Each node has a total input 𝑥, an activation function 𝑓(𝑥)
and an output 𝑦=𝑓(𝑥).
𝑓(𝑥) has to be a non-linear function, otherwise the neural network will only
be able to learn linear models.
A commonly used activation function is
the Sigmoid function:
𝑓(𝑥)=11+𝑒−𝑥.
Error function
The goal is to learn the weights of the network automatically from data such that the predicted output 𝑦𝑜𝑢𝑡𝑝𝑢𝑡
is close to the target 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 for all inputs 𝑥𝑖𝑛𝑝𝑢𝑡.
To measure how far we are from the goal, we use an error function 𝐸.
A commonly used error functon is 𝐸(𝑦𝑜𝑢𝑡𝑝𝑢𝑡,𝑦𝑡𝑎𝑟𝑔𝑒𝑡)=12(𝑦𝑜𝑢𝑡𝑝𝑢𝑡−𝑦𝑡𝑎𝑟𝑔𝑒𝑡)2.
Forward propagation
We begin by taking an input example (𝑥𝑖𝑛𝑝𝑢𝑡,𝑦𝑡𝑎𝑟𝑔𝑒𝑡) and updating the input layer of the network.
For consistency, we consider the input to be like any other node but without an activation function so its output is equal to its input, i.e. 𝑦1=𝑥𝑖𝑛𝑝𝑢𝑡.
Forward propagation
Now, we update the first hidden layer. We take the output 𝑦 of the nodes in the previous layer
and use the weights to compute the input 𝑥 of the nodes in the next layer.
Forward propagation
Then we update the output of the nodes in the first hidden layer.
For this we use the activation function, 𝑓(𝑥).
Forward propagation
Using these 2 formulas we propagate for the rest of the network and get the final output of the network.
Error derivative
The backpropagation algorithm decides how much to
update each weight of the network after comparing the predicted output with the desired output for a particular example.
For this, we need to compute how the error changes
with respect to each weight 𝑑𝐸𝑑𝑤𝑖𝑗.
Once we have the error derivatives, we can update the weights using a simple update rule:
where 𝛼 is a positive constant, referred to as the learning rate, which we need to fine-tune empirically.
[Note] The update rule is very simple: if the error goes down when the weight increases (𝑑𝐸𝑑𝑤𝑖𝑗<0),
then increase the weight, otherwise if the error goes up when the weight increases (𝑑𝐸𝑑𝑤𝑖𝑗>0),
then decrease the weight.
Additional derivatives
To help compute 𝑑𝐸𝑑𝑤𝑖𝑗, we additionally store for each node two more derivatives:
how the error changes with:
- the total input of the node 𝑑𝐸𝑑𝑥 and
- the output of the node 𝑑𝐸𝑑𝑦.
Back propagation
Let's begin backpropagating the error derivatives.
Since we have the predicted output of this particular input example, we can compute how the error changes with that output.
Given our error function 𝐸=12(𝑦𝑜𝑢𝑡𝑝𝑢𝑡−𝑦𝑡𝑎𝑟𝑔𝑒𝑡)2 we have:
Back propagation
Now that we have 𝑑𝐸𝑑𝑦 we can get 𝑑𝐸𝑑𝑥 using the chain rule.
where 𝑑𝑑𝑥𝑓(𝑥)=𝑓(𝑥)(1−𝑓(𝑥)) when
𝑓(𝑥) is the Sigmoid activation function.
Back propagation
As soon as we have the error derivative with respect to the total input of a node,
we can get the error derivative with respect to the weights coming into that node.
Back propagation
And using the chain rule, we can also get 𝑑𝐸𝑑𝑦 from the previous layer. We have made a full circle.
Back propagation
All that is left to do is repeat the previous three formulas until we have computed all the error derivatives.