Hyperparameters and normalization

9 min. read

Caution! This article is 3 years old. It may be obsolete or show old techniques. It may also still be relevant, and you may find it useful! So it has been marked as deprecated, just in case.


There are many hyperparameters that we can fine-tune when training a model. In previous posts we have played with the learning rate, the number, type and order of layers, the number of nodes, etc. Let's talk about regularization and batch size.


Regularization is a technique that helps reduce overfitting or reduce variance in our network by penalizing for complexity.

To implement regularization we add a term to our loss function that penalizes for large weights.

L2 regularization parameter

The L2 regularization term is j=1n||wj||2, or the sum of the squared norms of the weight matrices, multiplied by a small constant λ2m:


where λ is the regularization parameter (a hyperparameter) and m is the number of inputs. If we make λ large, SGD will make the weights small to minimize the loss, so a percentage of nodes will be ignored. Conceptually, it simplifies the model.

In Keras, we can pass the parameter kernel_regularizer to any layer. We can pass a λ of 0.01 as an argument to the L2 regularizer as shown below. Available regularizers can be found here.

Dense(units=32, activation='relu', kernel_regularizer=regularizers.l2(0.01)),

Batch size

The larger the batch size, the quicker our model will complete each epoch during training. This is because our machine may be able to process much more than one single sample in parallel.

However, the quality of the model may degrade if we set our batch too big and may cause the model to be unable to generalize well on data it hasn't seen before.

  • mini-batch gradient descent: gradient descent algorithm where the gradient update will occur on a per-batch basis. Default in Keras.
  • batch gradient descent: implements gradient updates per epoch.
  • stochastic gradient descent: implements gradient updates per sample.


Each neuron can have its own bias term, which acts as a threshold that may affect if the neuron gets activated or not. This bias is added to the weighted sum of outputs from the previous layer before passing it through the activation function. During training, not only the weights will be updated, but also these biases.

node output = activation ( weighted sum of inputs + bias)
            = activation ( a1w1 + a2w2 + ... + anwn + b)

Learnable parameters

The number of learnable parameters for a densely connected layer in a neural network can be calculated as:

inputs * outputs + biases

We ignore the input layer in this calculations. A 3-layer network with 2, 3 and 2 nodes where the hidden and output layers have biases will have 17 learnable parameters (9 hidden + 8 output).

For a convolutional layer the outputs are the number of filters multiplied by the size of the filters plus the number of biases (one per filter). A 3-layer network with 20x20 RGB images, 2 3x3 filters and 2 output nodes will have 1658 learnable parameters (3 channels 3x3x2 + 2 (56) and 20x20x2 filters 2 + 2 (1602)).

Transfer learning

There are so many hyperparameters to adjust that it can be overwhelming to fine tune them all. This is why we use transfer learning.

Transfer learning occurs when we use knowledge that was gained from solving one problem and apply it to a new but related problem. Fine-tuning takes a model that has already been trained for one given task and then tweaks the model to make it perform a second similar task.

Layers at the end of a model may have learned features that are very specific to the original task, where as layers at the start of the model usually learn more general features like edges, shapes, and textures.

Because of this, we could remove the last layer of the pre-trained model, when it was making predictions about a different problem, and add a new layer to train to make predictions on our problem. Depending on how different our problem is, we may remove or add more layers.

Before training we should freeze the original layers, this means their weights won't be updated. Only the weights in our new layers will be updating when training.

Batch normalization

Before training a model, we want to normalize or standardize our data as part of pre-processing.

  • Normalization: Scaling data so it is all in the same scale (for example, scaling images to the same size or numbers to be between 0 and 1).
  • Standarization: Substracting the mean and dividing by the deviation for each sample so that the data has a mean of 0 and a deviation of 1.

If we don't normalize the data, we may get unstable gradients. We may also need more time to train the model.

We can also apply batch normalization to any of the layers in our model so that their outputs are normalized during training. It normalizes the output from the activation function, and then multiplies it by some arbitrary parameter as well as adds an arbitrary parameter: (output * g) + b. These two new parameters are trainable, so they will be optimized like the weights and biases.

This process occurs on a per-batch basis.

In Keras:


The axis=1 is the features axis, and we can also specify the two trainable parameters with beta_initializer and gamma_initializer.