The process of moving the data forward through the network is called forward propagation. Backpropagation is the tool that gradient descent uses to calculate the gradient of the loss function with respect to each weight.
The value of an output node is the weighted sum of the outputs of the previous layer, after passing through the activation function of its own layer:
node output = activation ( weighted sum of inputs )
= activation ( a1w1 + a2w2 + ... + aNwN )
Therefore, if we want to update the values of the output nodes, we can update their weights. Another way is by changing the activation output from the previous layer.
We can't directly change the activation output because it's a calculation based on the weights and the previous layer's output. But, we can indirectly change this layer's output by jumping backwards and updating the previous weights.
We continue this process until we reach the input layer.
Mathematical representation
The notation supposes that we have:
-
L
layers with indexl
in the rangel = [0, L - 1]
-
n
nodes in a layerl
with indexj
in the rangej = [0, n - 1]
-
m
nodes in the previous layerl - 1
with indexk
in the rangek = [0, m - 1]
Loss function
For any node j
in the output layer L
, if the activation output is ajL, and the actual output is yj, then the loss C of training sample 0
is:
where yj is constant. The total loss will be the sum of the loss of all nodes:
This is what we have seen already when we covered the loss function and learning rate.
Input and output
For any node j
in any layer l
, the input zjl is the weighted sum of the activation outputs of all nodes in layer l - 1
, akl - 1, where the weights are wjkl.
The output of node j
in layer l
will be the activation function of layer l
gl applied to this weighted sum:
This is what we have seen already when we covered the introduction to artificial neural networks.
Differenciation
The total loss of the network for a single input 0
is a composition of functions:
To differentiate a composition of functions, we use the chain rule:
The first term is calculated differently for the output layer and the hidden layers. This is because the loss function is a direct function of the outputs in the last layer, but is an indirect function of the outputs in the hidden layers. So the derivative will be different.
Output layer
Let's check the values of the three terms for l = L
, j = 1
and k = 2
Putting it all together, the partial derivative of the loss with respect to one weight, for a single training sample 0
with j = 1
and k = 2
in the output layer L
is:
If we have a total of N
training samples, the derivative of the loss for all of them with respect to one weight in the output layer is the average sum:
Hidden layers
For the hidden layers we use the chain rule as before, but now the first term has to be calculated differently.
For l = L - 1
, j = 2
and k = 2
:
In the first term, we need to sum in j
because a change in a2L - 1 will affect all nodes in the layer L
:
The first two terms of the first term are calculated as before. The third is new and is calculated like this:
Putting it all together:
These gradients will be used to update the weights as we saw when explaining training and learning:
new weight = old weight - (learning rate * gradient)