ImageNet parameters training settings:
RMSProp optimizer with decay 0.9 and momentum 0.9; batch norm momentum 0.99; weight decay 1e-5.
Each model is trained for 350 epochs.
Batch size 4096.
Learning rate is first warmed up from 0 to 0.256, and then decayed by 0.97 every 2.4 epochs.
We use exponential moving average with 0.9999 decay rate, RandAugment, Mixup, Dropout and stochastic depth with 0.8 survival probability.
ImageNet21k parameters training settings:
Change the training epochs to 60 or 30 to reduce training time, and use cosine learning rate decay that can adapt to different steps without extra tuning.
Normalize the labels to have a sum of 1 before computing softmax loss.
They train some models on ImageNet (1000 classes and 1.28m images), and some models are pre-trained on ImageNet21k (21841 classes and 13m images).