MLP-Mixer: An all-MLP Architecture for Vision

In this post, MLP-Mixer is presented that is able to replace the Convolution and Self-Attention methods in Computer Vision tasks. MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference costs comparable to state-of-the-art models.

MLP refers to Multi-layered perceptrons (MLPs), introduced by Google Brain (Original VIT team).

Paper: https://arxiv.org/pdf/2105.01601.pdf

Code: https://github.com/google-research/vision_transformer

Motivation, Objectives and Related Works

Motivation

Convolutions and Attention are both sufficient for good performance, neither of them is necessary.

Objectives

MLP-Mixer
An architecture based exclusively on multi-layer perceptrons (MLPs).
Relies only:
1. Matrix multiplication.
2. Data Layout (reshapes and transpositions)
3. Scalar Non-linearities.
Achieves competitive results on image classification benchmarks with almost 3x speed.

Model

Per-patch Linear Embedding.
Nx(Mixer layers).
A Classifier Head.

Fig. The architecture of MLP-Mixer

Per-patch Linear Embeddings

1. - Input: S non-overlapping patches, extracted from the original Image.

=> If the original image has a resolution (H, W), and each patch has a resolution (P, P). The number of patches: S = HW/P2.

All S patches are linearly projected (with the same projection matrix) to a desired hidden dimension C.

=> Actually, this process is a 2D convolution with an output channel C, kernel PxP, stride P, and no padding.

Output: A two-dimensional real-valued table: X∈ RSxC - a sequence of S embedded patch tokens, having the same hidden dimension C.

Example:

Original Image size: 224x224x3; Patch size: 16x16;
Output size: X∈RSxC: (14x14)x(1x1x768) = 196x768

MLP Block

Contains 2 fully connected layers and 1 GELU nonlinearity activation function.

The GELU activation function is xΦ(x), where Φ(x) is the standard Gaussian cumulative distribution function

Mixer Layer

A Mixer Layer contains two types of MLP blocks:

The Token-Mixing MLP:
- Allow communication between different spatial locations (tokens); operate on each channel independently and take individual columns of the table as inputs.
- It acts on columns of X (i.e. it is applied to a transposed input table XT), maps RS → RS, and is shared across all columns.
The Channel-Mixing MLP:

Allow communication between different channels; operate on each token independently and take individual rows of the table as inputs.
It acts on rows of X, maps RC → RC, and is shared across all rows.

=> These two types of layers are interleaved to enable interaction of both input dimensions => MIXER Layer

Fig. The architecture of a Mixer Layer composed of 2 MLP blocks.

Formula

σ is an element-wise nonlinearity (GELU [16]).
W is the parameters of fully connected layers inside the MLP Block.
DS and DC are tunable hidden widths in the token-mixing and channel-mixing MLPs, respectively.
DS is selected independently of the number of input patches, the computational complexity of the network is linear in the number of input patches.
DC is independent of the patch size, the overall complexity is linear in the number of pixels in the image.

Other Components

Skip-connections.
Layer Norm on channels.
Dropout.

Classifier head

A standard classification head including:
1. A global average pooling layer.
2. A linear classifier.

Training Strategy

Pre-training details.
- Optimizer: Adam (β1 = 0.9, β2 = 0.999);
- Batch size 4,096;
- Weight decay;
- Gradient clipping.
- A linear learning rate warmup.
- Pre-train at resolution 224.
- Applying the cropping technique and random horizontal flipping (for JFT-300M).
- Employing data augmentation and regularization techniques (for ImageNet and ImageNet-21k).
- RandAugment, mixup, dropout, and stochastic depth.
Fine-tuning details.
- SGD with momentum.
- Batch size 512.
- Gradient clipping at global norm 1.
- Cosine learning rate schedule with a linear warmup.
- Apply fine-tune at higher resolutions.

Comparison: MLP-Mixer vs CNN vs Vision Transformers

Similarities

From the perspective of CNN:
1. Channel Mixing MLP is similar to 1x1 convolutions.
2. Token Mixing MLP is similar to single-channel depth-wise convolutions.
Positional invariance - a prominent feature of convolution: The same channel-mixing MLP (token-mixing MLP) is applied to every row (column) of X. Tying the parameters of the channel-mixing MLPs (within each layer) is a natural choice - providing positional invariance.
Same input size in every layer: Each layer in Mixer takes an input of the same size that is similar to Transformers, or deep RNNs in other domains.
Skip connections and regularisation are utilized in all models.

Difference

Feature Fusion: Neural Networks do feature fusion (i) at a given spatial position; (ii) between different spatial positions; or both at once.
- In CNNs, (NxN) convolutions (for N>1) and pooling implement (ii); (1x1) convolutions perform (i); or larger kernels perform both (i) and (ii).
- In the Vision Transformers and Attention-based frameworks, Self-attention executes (i) and (ii); MLP-blocks perform only (i).
- MLP-Mixer executes (i) per-location (channel-mixing) and (ii) cross-location (token-mixing) separately.
Kernel size: In separable convolutions, a different convolutional kernel is applied to each channel; MLPs in Mixer share the same kernel for all of the channels.
Input size: CNNs, which have a pyramidal structure, whose deeper layers take lower resolution input, but more channels.
Complexity:
- Convolution is more complex (requires an additional cost reduction to matrix multiplication or specialized implementation).
- The computational complexity of the MLP-Mixer is linear in the number of input patches, unlike vision transformers whose complexity is quadratic.
Position embeddings: Unlike ViTs, Mixer does not use position embeddings because the token-mixing MLPs are sensitive to the order of the input tokens, and therefore may learn to represent location.
Mixer uses a standard classification head with the global average pooling layer followed by a linear classifier.

Experiment Results

Evaluation Quantities

Accuracy of the downstream tasks.
Total computational cost of pre-training.
Throughput at inference time.

Goal: Not to demonstrate state-of-the-art results, but to show that, remarkably, a simple MLP-based model is competitive with today’s best convolutional and attention-based models.

Dataset

Downstream tasks. ILSVRC2012 “ImageNet”, cleaned-up ReaL labels, CIFAR-10/100, Oxford-IIIT Pets, Oxford Flowers-102, and Visual Task Adaptation Benchmark (VTAB-1k).
Pre-training data. Pre-train on ILSVRC2021 ImageNet, and ImageNet-21k, a superset of ILSVRC2012. At larger scale, pre-train on JFT-300M.

Metrics

Complexity: compute two metrics: (1) Total pre-training time on TPU-v3 accelerators, (2) Throughput in images/sec/core on TPU-v3.
Quality: focus on top-1 downstream accuracy after fine-tuning.

Results

Main Results

When pre-trained on ImageNet-21k with additional regularization:
- Mixer achieves an overall strong performance (84.15% top-1 on ImageNet).
- Mixer-B/16 attains a reasonable score of 76.4% at resolution 224, but tends to overfit when using random initialization.
- This score is similar to a vanilla ResNet50, but behind state-of-the-art CNNs/hybrids for the ImageNet “from scratch” setting, e.g. BotNet attains 84.7% [42], and NFNet attains 86.5% [7].
When the size of the upstream dataset increases:
- Mixer’s performance improves significantly.
- Mixer-H/14 achieves 87.94% top-1 accuracy on ImageNet, which is 0.5% better than BiTResNet152x4 and only 0.5% lower than ViT-H/14.
- Mixer-H/14 runs 2.5 times faster than ViT-H/14 and almost twice as fast as BiT.

The Role of Model Scale

Scale the model in two independent ways:
1. - 1. Increasing the model size (number of layers, hidden dimension, MLP widths) when pre-training => affects both pre-training compute and test-time throughput.
    2. Increasing the input image resolution when fine-tuning => only affects the throughput.
  - Results: (Table 3 and Figure 3)

When trained from scratch on ImageNet:
- Mixer-B/16 achieves a reasonable top-1 accuracy of 76.44% (3% behind the ViT-B/16 model)
  1. Mixer-B/16, and Mixer-L/16 overfits more than ViT-B/16, and ViT-L/16 on training loss, respectively.
Mixer-H/14 pre-trained on JFT-300M and fine-tuned at 224 resolution is only 0.3% behind ViT-H/14 on ImageNet whilst running 2.2 times faster.

=> As the pre-training dataset grows, Mixer’s performance steadily improves.

=> Figure 3 clearly demonstrates that although Mixer is slightly below the frontier on the lower end of model scales, it sits confidently on the frontier at the high end.

The Role of Pre-training Dataset Size

Purpose:

The pre-training on larger datasets significantly improves Mixer’s performance.
Pre-train Mixer-B/32, Mixer-L/32, and Mixer-L/16 models on random subsets of JFT-300M containing 3%, 10%, 30% and 100% of all the training examples for 233, 70, 23, and 7 epochs.
We use the linear 5-shot top-1 accuracy on ImageNet as a proxy for transfer quality.

Results: (Figure 2 - right)

When pre-trained on the smallest subset of JFT-300M, all Mixer models strongly overfit.
As the dataset increases, the performance of both Mixer-L/32 and Mixer-L/16 grows faster than BiT; Mixer-L/16 keeps improving, while the BiT model plateaus. The same conclusions hold for ViT, consistent with Dosovitskiy et al. [14].
The performance gap between Mixer-L/16 and ViT-L/16 shrinks with data scale.

Visualization

Purpose:

The first layers of CNNs tend to learn Gabor-like detectors that act on pixels in local regions of the image.
In contrast, Mixer allows for global information exchange in the token-mixing MLPs, which begs the question whether it processes information in a similar fashion.

Results:

Figure 4 shows the weights in the first few token-mixing MLPs ( allow communication between different spatial locations) of Mixer trained on JFT-300M.
Some of the learned features operate on the entire image, while others operate on smaller regions.
The first token-mixing MLP contains many local interactions, while the second and third layers contain more mixing across larger regions.
Higher layers appear to have no clearly identifiable structure. Similar to CNNs, we observe that many of the low-level feature detectors appear in pairs with opposite phases [39].

References

Page updated

Google Sites

Report abuse

MLP-Mixer: An all-MLP Architecture for Vision

Motivation, Objectives and Related Works

Motivation

Objectives

Model

Model

Per-patch Linear Embeddings

MLP Block

Mixer Layer

Formula

Other Components

Classifier head

Training Strategy

Comparison: MLP-Mixer vs CNN vs Vision Transformers

Similarities

Difference

Experiment Results

Evaluation Quantities

Dataset

Metrics

Results

Main Results

The Role of Model Scale

The Role of Pre-training Dataset Size

Visualization

References

About Me: