Overfitting / underfitting and supervised / unsupervised learning

7 min. read

Caution! This article is 3 years old. It may be obsolete or show old techniques. It may also still be relevant, and you may find it useful! So it has been marked as deprecated, just in case.

Overfitting and Underfitting

Overfitting

Overfitting occurs when our model becomes really good at being able to classify or predict on data that was included in the training set, but is not as good at classifying data that it wasn't trained on. So essentially, the model has overfit the data in the training set.

We can tell if the model is overfitting if the metrics that are given for our training data are better than the ones for the validation data during training, or if it can't make predictions on test or real data.

Some ways to reduce it:

  • Adding more training data samples: this makes the dataset more diverse.
  • Data augmentation: Cropping, rotating, flipping or zooming the training samples to generate more data.
  • Reduce the complexity of the model: Remove layers and/or neurons.
  • Dropout: Regularization technique that randomly ignores a percentage of neurons in a given layer.

Underfitting

A model is said to be underfitting when it's not even able to classify the data it was trained on, let alone data it hasn't seen before.

We can tell that a model is underfitting when the metrics given for the training data are poor, meaning that the training accuracy of the model is low and/or the training loss is high.

Some ways to reduce it:

  • Increase the complexity of the model: Add layers and/or neurons, changing what type of layers we're using and where.
  • Adding more features to training data samples
  • Reduce Dropout: If we are ignoring say 50% of the neurons in a layer, reduce to 25% for example.

Supervised and unsupervised learning

Supervised learning

With supervised learning, each piece of data passed to the model during training is a pair that consists of the sample and the corresponding label.

The model will classify the output of an input sample, and then determine its error by looking at the difference between the value it predicted and the sample's actual label.

Unsupervised learning

With unsupervised learning, each piece of data passed to our model during training is solely an unlabelled input object, or sample. There is no corresponding label that's paired with the sample. The model will attempt to learn some type of structure from the data and extract its features.

Since the model is unaware of the labels for the training data, there is no way to measure accuracy.

Examples:

  • Clustering algorithms: A clustering algorithm could analyze the data samples and start to learn the structure of it even though it's not labelled. Through learning the structure, it can start to cluster the data into groups.
  • Autoencoders: Unsupervised learning is also used by autoencoders. This neural network will take in an input, and it will encode it, then, it will output the decoded reconstructed version of the original input. The goal is for the reconstructed sample to be as close as possible to the original sample. One application for this could be to denoise images.

Autoencoders are data-specific which makes them generally impractical for real-world data compression problems: you can only use them on data that is similar to what they were trained on, and making them more general thus requires lots of training data. But future advances might change this, who knows. The creator of Keras wrote a blog post about autoencoders.

Semi-supervised learning

Semi-supervised learning uses a combination of supervised and unsupervised learning techniques. It is used when we have a combination of both labelled and unlabelled data.

With semi-supervised learning, we first manually label a subset of our unlabelled data and train our model with it using supervised learning. Then we use the model to predict the labels of the remaining unlabelled data. This process is called pseudo-labeling. Finally, we retrain the model using the full dataset of labelled and pseudo-labelled data.

One-hot encoding

Labels for classes are encoded into integers or arrays of integers. One-hot encodings transform our categorical labels into vectors of 0s and 1s. The length of these vectors is the number of classes or categories that our model is expected to classify.

Example:

  • Cat: [1, 0, 0]
  • Dog: [0, 1, 0]
  • Frog: [0, 0, 1]
Value Interpretation
0 Cold
1 Hot

With each one-hot encoded vector, every element will be a zero EXCEPT for the element that corresponds to the actual category of the given input, which will be the "hot" element.

Comments