[π-model] [Temporal Ensembling] Temporal Ensembling for Semi-Supervised Learning
{, }
Paper: https://arxiv.org/abs/1610.02242
Code:
{, }
Paper: https://arxiv.org/abs/1610.02242
Code:
1) Motivation, Objectives and Related Works:
Motivation:
Objectives:
In this paper, we present a simple and efficient method for training deep neural networks in a semi-supervised setting where only a small portion of training data is labeled.
We introduce self-ensembling, where we form a consensus prediction of the unknown labels using the outputs of the network-in-training on different epochs, and most importantly, under different regularization and input augmentation conditions.
This ensemble prediction can be expected to be a better predictor for the unknown labels than the output of the network at the most recent training epoch, and can thus be used as a target for training.
Using our method, we set new records for two standard semi-supervised learning benchmarks, reducing the (non-augmented) classification error rate from 18.44% to 7.05% in SVHN with 500 labels and from 18.63% to 16.55% in CIFAR-10 with 4000 labels, and further to 5.12% and 12.16% by enabling the standard augmentations. We additionally obtain a clear improvement in CIFAR-100 classification accuracy by using random images from the Tiny Images dataset as unlabeled extra inputs during training. Finally, we demonstrate good tolerance to incorrect labels.
Related Works:
Contribution:
2) Methodology:
π-model:
The key idea is to create two random augmentations of an image for both labeled and unlabeled data. Then, a model with dropout is used to predict the label of both these images.
The square difference of these two predictions is used as a consistency loss.
For labeled images, we also calculate the cross-entropy loss.
The total loss is a weighted sum of these two loss terms. A weight w(t) is applied to decide how much the consistency loss contributes in the overall loss.
Temporal Ensembling:
It modifies the π-model by leveraging the Exponential Moving Average (EMA) of predictions.
The key idea is to use the exponential moving average of past predictions as one view. To get another view, we augment the image as usual and a model with dropout is used to predict the label.
The square difference of current prediction and EMA prediction is used as a consistency loss.
For labeled images, we also calculate the cross-entropy loss.
The final loss is a weighted sum of these two loss terms. A weight w(t) is applied to decide how much the consistency loss contributes in the overall loss.
3) Experimental Results:
Experimental Results:
Ablations:
References:
n2 n0
θ