EfficientNetV2: Smaller Models and Faster Training

In this post, I summary the ideas from a new paper from Google Brain.

(1) EfficientNet-V2 (A new SOTA architecture - resulting from an improved NAS)

(2) Progressive Learning (A new approach to speed up training/ inference processes)

[Full Paper] [Code]

0. Motivation, Objective and Related Works:

Motivation:

EfficientNet(V1): slow training on large image sizes, depth-wise convolution is slow in early layers, equally scaling up all stages is suboptimal.
Progressive Learning: The accuracy drop comes from the unbalanced regularization when training with different image sizes.

Objectives:

EfficientNetV2:
- Desing a new Search Space on Fused-MBConv + apply training-aware NAS and scaling.
- EfficientNetV2 trains 5x - 11x faster than others, while using up to 6.8x fewer parameters.
Progressive Learning: Speed up training = adaptively adjusts regularization along with image size.
- Small image size + weak regularization (dropout + data augmentation).
- Increase image size (~ progressive resizing) + make regularization stronger.

Related works:

Training and Parameter Efficiency: Improve accuracy => cost parameters.
- Parameter efficiency: DenseNet, EfficientNet.
- Improve training/ inference speed: (TPU) Lambda Network, NFNets, BoTNets, ResNet-RS; (GPU/ TPU) RegNet, ResNeSt, TResNet, EfficientNet-X.
Progressive Training: Speed-up training => cost of accuracy.
- Change the training settings or networks: GANs, transfer learning, adversarial learning, language model.
- Progressive Resizing: Fastai’s methods in DAWNBench, Mix&Match.
- Schedules training examples from easy to hard: Curriculum training.
Neural Architecture Search (NAS): mostly focuses on improving FLOPs efficiency or inference efficiency.

1. EfficientNetV2 Architecture Design:

1.1 Problems:

EfficientNet(V1) is still efficient:

Many recent works have claimed large gains on training or inference speed, they are often much worse than EfficientNet in terms of parameters and FLOPs efficiency (Table 1).

Need to improves the training speed while maintaining the parameter efficiency

Training models with very large image sizes is slow:

EfficientNet’s large image size results in significant memory usage.
Since the total memory on GPU/TPU is fixed, we have to train these models with smaller batch sizes, which drastically slows down the training.

Smaller image size leads to less computations and enables large batch size, and thus improves training speed by up to 2.2x.

Depthwise convolutions (MBConv) are slow in early layers:

Depthwise convolutions have fewer parameters and FLOPs than regular convolutions, but they often cannot fully utilize modern accelerators.

A Fused-MBConv: replaces the depthwise conv3x3 and expansion conv1x1 in MBConv with a single regular conv3x3.

Finding the right combination of MBConv and Fused-MBConv is nontrivial.

When applied in the early stage 1-3, Fused-MBConv can improve training speed with a small overhead on parameters and FLOPs.
If we replace all blocks with Fused-MBConv (stage 1-7), then it significantly increases parameters and FLOPs while also slowing down the training

Leverage Neural Architecture Search to automatically search for the best combination.

Equally scaling up every stage is sub-optimal

EfficientNet equally scales up all stages using a simple compound scaling rule. For example, when the depth coefficient is 2, then all stages in the networks would double the number of layers.
These stages are not equally contributed to the training speed and parameter efficiency.

Use a non-uniform scaling strategy to gradually add more layers to later stages

1.2 Modifications in Neural Architecture Search (NAS) and EfficientNetV2-S Architecture:

Training-Aware NAS: (jointly optimize accuracy, parameter efficiency, and training efficiency on modern accelerators)

Backbone: EfficientNet.

Search space:

- A stage-based factorized space, consists of different convolutional operation types {MBConv, Fused-MBConv}, number of layers, kernel size {3x3, 5x5}, expansion ratio {1, 4, 6}.
- Reduce its size by:
  1. Removing unnecessary search options such as pooling skip ops.
  2. Reusing the same channel sizes from the backbone.

Search reward:

- The model accuracy A.
- The normalized training step time S.
- The parameter size P.
- Using a weighted product: A · Sw · Pv (w = -0.07 and v = -0.05: hyperparameters utilized to balance the trade-offs).

Sample up to 1000 models and run them for 10 epochs with reduced image size.

EfficientNetV2-S:

Using both MBConv and Fused-MBConv in the early layers.
Preferring a smaller expansion ratio for MBConv (less memory access).
Preferring smaller 3x3 kernel sizes, but adding more layers to compensate for the reduced receptive field.
Removing the last stride-1 stage in the original EfficientNet (reduce parameter size and memory access overhead).

1.3 Scale-up to EfficientNetV2-M/L:

Using similar compound scaling with a few optimizations:

Restrict the maximum inference image size to 480 (large images often lead to expensive memory and training speed)
Gradually add more layers to later stages (increase the network capacity without adding much runtime overhead).

With the training-aware NAS and scaling, the proposed EfficientNetV2 model train much faster than other recent models (Figure 3).

2. Progressive Learning:

2.1 Motivation:

The relationship between image size and regularization.

In the same network, smaller image size leads to smaller network capacity and thus needs weaker regularization.
Vice versa, larger image size leads to more computations with larger capacity, and thus more vulnerable to overfitting

When image size is small, it has the best accuracy with weak augmentation; but for larger images, it performs better with stronger augmentation (Table 5).

2.2 Progressive Learning with adaptive Regularization:

Algorithm:

In the early training epochs, train the network with smaller images and weak regularization, such that the network can learn simple representations easily and fast.
Then, gradually increase image size but also making learning more difficult by adding stronger regularization.

Adaptively adjust regularization along with image size during training, leading to the improved method of progressive learning.

Three types of regularization:

Dropout: a network-level regularization, which reduces co-adaptation by randomly dropping channels. We will adjust the dropout rate γ.
RandAugment: a per-image data augmentation, with the adjustable magnitude.
Mixup: a cross-image data augmentation. We would adjust the mixup ratio λ during training.

3. Training properties and Results:

3.1 Training Properties:

ImageNet parameters training settings:

RMSProp optimizer with decay 0.9 and momentum 0.9; batch norm momentum 0.99; weight decay 1e-5.
Each model is trained for 350 epochs.
Batch size 4096.
Learning rate is first warmed up from 0 to 0.256, and then decayed by 0.97 every 2.4 epochs.
We use exponential moving average with 0.9999 decay rate, RandAugment, Mixup, Dropout and stochastic depth with 0.8 survival probability.

ImageNet21k parameters training settings:

Change the training epochs to 60 or 30 to reduce training time, and use cosine learning rate decay that can adapt to different steps without extra tuning.
Normalize the labels to have a sum of 1 before computing softmax loss.

They train some models on ImageNet (1000 classes and 1.28m images), and some models are pre-trained on ImageNet21k (21841 classes and 13m images).

Progressive training settings:

Divide the training process into four stages with about 87 epochs per stage.
Table 6 shows the minimum (for the first stage) and maximum (for the last stage) values of image size and regularization.

All models use the same minimum values of size and regularization, but they adopt different maximum values, as larger models generally require more regularization to combat overfitting.

3.2 Results:

Performance on ImageNet Dataset:

EfficientNetV2-M achieves comparable accuracy to EfficientNet-B7 while training 11x faster using the same computing resources.
EfficientNetV2 models also significantly outperform all recent RegNet and ResNeSt, in both accuracy and inference speed.
EfficientNetV2-L achieves 85.7% top-1 accuracy, surpassing ViT-L/16(21k), a much larger transformer model pre-trained on a larger ImageNet21k dataset.

Performance on ImageNet21k Dataset:

Compared to the recent ViT-L/16(21k), EfficientNetV2- L(21k) improves the top-1 accuracy by 1.5% (85.3% vs. 86.8%), using 2.5x fewer parameters and 3.6x fewer FLOPs, while running 6x - 7x faster in training and inference.

EfficientNetV2 models are significantly faster and achieve better accuracy and parameter efficiency than previous ConvNets and Transformers.

Observations:

Scaling up data size is more effective than simply scaling up model size in high-accuracy regime.
Pretraining on ImageNet21k could be quite efficient.

Model size, FLOPs, and inference latency:

Slightly better parameters/FLOPs efficiency than EfficientNets, but the inference latency is up to 3x faster than EfficientNets.
Compared to the recent ResNeSt, EfficientNetV2-M achieves 0.6% better accuracy with 2.8x faster inference speed.

Performance on transfer learning datasets (CIFAR-10, CIFAR-100, Flowers and Cars):

EfficientNetV2 models outperform previous ConvNets and Vision Transformers for all these datasets, sometimes by a non-trivial margin.

3.3 Ablations:

If EfficientNet is trained with the same progressive learning as EfficientNetV2, its performance improves, but EfficientNetV2 is still better.

Compared to these small-size EfficientNets (V1), EfficientNetV2 models are generally faster while maintaining comparable parameter efficiency.

The progressive learning generally reduces the training time and meanwhile improves the accuracy for all different networks

Compared to the vanilla approaches of progressive or random resizing that use the same regularization for all image sizes, our adaptive regularization improves the accuracy by 0.7%.

The adaptive regularization uses much smaller regularization for small images at the early training epochs, allowing models to converge faster and achieve better final accuracy.

Page updated

Google Sites

Report abuse

EfficientNetV2: Smaller Models and Faster Training

About Me: