The process of moving the data forward through the network is called forward propagation. Backpropagation is the tool that gradient descent uses to calculate the gradient of the loss function with respect to each weight.
The value of an output node is the weighted sum of the outputs of the previous layer, after passing through the activation function of its own layer:
node output = activation ( weighted sum of inputs ) = activation ( a1w1 + a2w2 + ... + aNwN )
Therefore, if we want to update the values of the output nodes, we can update their weights. Another way is by changing the activation output from the previous layer.
We can't directly change the activation output because it's a calculation based on the weights and the previous layer's output. But, we can indirectly change this layer's output by jumping backwards and updating the previous weights.
We continue this process until we reach the input layer.
The notation supposes that we have:
Llayers with index
lin the range
l = [0, L - 1]
nnodes in a layer
jin the range
j = [0, n - 1]
mnodes in the previous layer
l - 1with index
kin the range
k = [0, m - 1]
For any node
j in the output layer
L, if the activation output is ajL, and the actual output is yj, then the loss C of training sample
where yj is constant. The total loss will be the sum of the loss of all nodes:
This is what we have seen already when we covered the loss function and learning rate.
Input and output
For any node
j in any layer
l, the input zjl is the weighted sum of the activation outputs of all nodes in layer
l - 1, akl - 1, where the weights are wjkl.
The output of node
j in layer
l will be the activation function of layer
l gl applied to this weighted sum:
This is what we have seen already when we covered the introduction to artificial neural networks.
The total loss of the network for a single input
0 is a composition of functions:
To differentiate a composition of functions, we use the chain rule:
The first term is calculated differently for the output layer and the hidden layers. This is because the loss function is a direct function of the outputs in the last layer, but is an indirect function of the outputs in the hidden layers. So the derivative will be different.
Let's check the values of the three terms for
l = L,
j = 1 and
k = 2
Putting it all together, the partial derivative of the loss with respect to one weight, for a single training sample
j = 1 and
k = 2 in the output layer
If we have a total of
N training samples, the derivative of the loss for all of them with respect to one weight in the output layer is the average sum:
For the hidden layers we use the chain rule as before, but now the first term has to be calculated differently.
l = L - 1,
j = 2 and
k = 2:
In the first term, we need to sum in
j because a change in a2L - 1 will affect all nodes in the layer
The first two terms of the first term are calculated as before. The third is new and is calculated like this:
Putting it all together:
These gradients will be used to update the weights as we saw when explaining training and learning:
new weight = old weight - (learning rate * gradient)