Datasets and predictions

5 min. read

Caution! This article is 3 years old. It may be obsolete or show old techniques. It may also still be relevant, and you may find it useful! So it has been marked as deprecated, just in case.

Dataset types

The the training dataset is separate from both the validation dataset and the test dataset.

Train dataset

This is the labelled data that we pass to the model over and over again so that it learns from it. The weights will be updated in the model based on the loss calculated from the train data. That way we can deploy our model in production and have it accurately predict on new data that it’s never seen before.

Validation dataset

This is a set of labelled data that the model has never seen. In each epoch, each output will be classified and the loss will be calculated using the training data, then the same will happen with the validation data. The weights will not be updated in the model based on the loss calculated from the validation data.

One of the major reasons we need a validation dataset is to ensure that our model is not overfitting to the data in the training set. If the results on the training data are really good, but the results on the validation data are lagging behind, then our model is overfitting. The validation dataset allows us to see how well the model is generalizing during training.

Test dataset

The test data is a set of unlabelled data that the model has never seen, which is used to test the model after the model has already been trained. The test data provides a final check that the model is generalizing well before deploying the model to production, where it will make predictions on unlabelled data it has never seen before.

Code

In Keras, we can either specify a percentage to split our train dataset into training and validation, or directly pass the validation set. The example below splits the train dataset into 20% validation data and 80% training data:


model.fit(x=scaled_train_samples, y=train_labels, validation_split=0.20)

If we pass the validation dataset directly it should be a list of tuples:


validation_dataset = numpy.array([(sample, label), (sample, label), ..., (sample, label)])

model.fit(x=scaled_train_samples, y=train_labels, validation_data=validation_dataset)

If we pass a validation dataset to the fitting function, the output will not only display loss and accuracy, but also validation loss and validation accuracy.

Predicting

Predictions are based on what the model learned during training. During the prediction phase, we pass the model unlabelled test data or unlabelled real time data from our test dataset.

This process will also tell us what our model has or hasn’t learned. If we trained the model with large dogs and we pass it small dogs, it won't work well. We need to make sure that our training and validation sets are representative of the actual data we want our model to be predicting on.

We can do predictions in Keras like this:


predictions = model.predict(x = scaled_test_samples, batch_size = 10, verbose = 0)

for prediction in predictions:
    print(prediction)

=> [
[ 0.7410683  0.2589317]
[ 0.14958295  0.85041702]
...
]

The output will show predictions for each class and test sample. In this case, we have two classes for every sample.

Comments