Weight initialization

4 min. read

Caution! This article is 3 years old. It may be obsolete or show old techniques. It may also still be relevant, and you may find it useful! So it has been marked as deprecated, just in case.

Vanishing and exploding gradients

The gradient of the loss with respect to any weight is a product of derivatives that depend on components from later in the network. The earlier in the network a weight lives, the more terms in this product.

Once SGD calculates the gradient with respect to a weight, the weight gets updated like this:

new weight = old weight - (learning rate * gradient)

A vanishing gradient happens when one or many of the terms in the gradient product is too small (lower than 1). In that case, the value of the weight gets stuck around a value and the network does not learn.

An exploding gradient happens when one or many of the terms in the gradient product is too big (larger than 1). In that case, the optimal value for the weight won't be achieved because the update with each epoch is too large, pushing it further and further away from its optimal value.

The problem of vanishing gradients and exploding gradients is actually a more general problem of unstable gradients.

Weight initialization plays a role in how well and quickly we can train our networks. The inputs to our neurons have large variance when we just randomly generate a normally distributed set of weights, and this causes unstable gradients.

Random weight initialization

When we build a model, the values for the weights will be initialized with random numbers that are normally distributed with mean 0 and standard deviation 1.

However, when we calculate the input for each node in the next layer, the standard deviation of the weighted sum of outputs from the previous layer will be bigger than 1. So the weighted sum is more likely to have a value that is significantly larger or smaller than 1.

After passing through the activation function, the nodes would be very activated as a result, and SDG will make very small changes.

Xavier / Glorot weight initialization

Rather than the distribution of these weights having a deviation of 1, they will have a smaller variance of 1 / n, where n is the number of weights of the previous layer. They will still have a mean of 0.

To achieve this we take the random weights from before and multiply them by 1n. If we are using a layer with ReLU as activation function, we should multiply by 2n.

Initially it was defined as 1nin+nout, where nin was the number of weights entering the neuron and nout exiting the neuron.

In Keras, we can use the option kernel_initializer='glorot_uniform', although it is the default. Available initializers are listed here.

Dense(32, activation='relu', kernel_initializer='glorot_uniform')