[Mean Teacher]
{, }
Paper: https://arxiv.org/abs/1703.01780
Code:
1) Motivation, Objectives and Related Works:
Motivation:
The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks.
It maintains an exponential moving average of label predictions on each training example and penalizes predictions that are inconsistent with this target.
However, because the targets change only once per epoch, Temporal Ensembling becomes unwieldy when learning large datasets.
Objectives:
Propose Mean Teacher, a method that averages model weights instead of label predictions.
As an additional benefit, Mean Teacher improves test accuracy and enables training with fewer labels than Temporal Ensembling.
Without changing the network architecture, Mean Teacher achieves an error rate of 4.35% on SVHN with 250 labels, outperforming Temporal Ensembling trained with 1000 labels. We also show that good network architecture is crucial to performance. Combining Mean Teacher and Residual Networks, we improve the state of the art on CIFAR-10 with 4000 labels from 10.55% to 6.28%, and on ImageNet 2012 with 10% of the labels from 35.24% to 9.11%.
Related Works:
Contribution:
The general approach is similar to Temporal Ensembling but it uses Exponential Moving Average (EMA) of the model parameters instead of predictions.
The key idea is to have two models called “Student” and “Teacher”.
The student model is a regular model with dropout.
The teacher model has the same architecture as the student model but its weights are set using an exponential moving average of the weights of the student model.
For a labeled or unlabeled image, we create two random augmented versions of the image.
The student model is used to predict label distribution for first image.
The teacher model is used to predict the label distribution for the second augmented image.
The square difference of these two predictions is used as a consistency loss.
For labeled images, we also calculate the cross-entropy loss.
The final loss is a weighted sum of these two loss terms. A weight w(t) is applied to decide how much the consistency loss contributes in the overall loss.
2) Methodology:
Method 1:
Method 2:
3) Experimental Results:
Experimental Results:
Ablations:
References:
n2 n0
θ