Loss function and learning rate

4 min. read

Caution! This article is 3 years old. It may be obsolete or show old techniques. It may also still be relevant, and you may find it useful! So it has been marked as deprecated, just in case.

Loss function

The loss function is what SGD is attempting to minimize by iteratively updating the weights in the network. It is basically the error of our prediction:

error = prediction - actual value

In each epoch, the error is calculated for every output and accumulated across all the individual outputs. For example, the Mean Squared Error (MSE) is calculated like this:

MSE(input) = (output - label) * (output - label) = e_input^2

MSE = ( e1^2 + e2^2 + ... eN^2 ) / N
  • If we passed our entire training set to the model at once (batch_size = 1), then the loss would be calculated at the end of each epoch during training.
  • If we split our training set into batches, and passed batches one at a time to our model, then the loss would be calculated on each batch.

In Keras you can specify the loss function when you compile the model:

    optimizer = Adam(learning_rate = 0.0001),
    loss = 'sparse_categorical_crossentropy',
    metrics = ['accuracy']

or independently:

> model.loss = 'sparse_categorical_crossentropy'
> model.loss
=> 'sparse_categorical_crossentropy'

Here is a list of the available loss functions in Keras.

Learning rate

The objective during training is for SGD to minimize the loss between the actual output and the predicted output. This is done in steps, and we can think of the learning rate of our model as the step size, the size of the adjustments made to the weights.

After the loss is calculated for our inputs, the gradient of that loss is then calculated with respect to each of the weights in our model. Once we have the value of these gradients, they will get multiplied by the learning rate and subtracted from each of the old weights to get their updated value.

new weight = old weight - (learning rate * gradient)

The value we choose for the learning rate is going to require some testing. The learning rate is another one of those hyperparameters that we have to test and tune with each model before we know exactly where we want to set it, but as mentioned earlier, a typical guideline is to set it somewhere between 0.01 and 0.0001.

Learning rate plot. Learning rate finder  plot.
Learning rate. From a post by J. Jordan.

Each optimizer in Keras has its own default learning rate value, but you can change it. You can specify the learning rate when you choose the optimizer:

    optimizer = Adam(learning_rate = 0.0001),
    loss = 'sparse_categorical_crossentropy',
    metrics = ['accuracy']

or directly in the optimizer, (after compiling the model):

model.optimizer.lr = 0.0001

=> 0.0001