[MAGNETO] Foundation Transformers

{Multi-modal, Sub-LayerNorm, Initialization}

Paper: https://arxiv.org/pdf/2210.06423.pdf

Code: https://github.com/microsoft/unilm

1) Motivation, Objectives and Related Works :

Motivation:

A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name “Transformers”, the above areas use different implementations for better performance:
1. Post-LayerNorm for BERT
2. Pre-LayerNorm for GPT and vision Transformers.
Foundation Transformer ==> create an architecture that serves as a go-to architecture for various tasks and modalities with guaranteed training stability.

Objectives: MAGNETO

Propose Sub-LayerNorm, add another LayerNorm inside each sublayer, for good expressivity, and the initialization strategy theoretically derived from DeepNet (Wang et al., 2022a) for stable scaling up.
Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants.

Post-LN: Layer-Norm[Linear(Attention(Linear(x))) + x]

Pre-LN: Linear(Attention(Linear(Layer-Norm(x)))) + x

Sub-LN: Linear(Sub-LN(Attention(Linear(Layer-Norm(x))))) + x

Related Works:

Contribution:

MAGNETO model
Sub-LayerNorm

2) Methodology:

Sub-LN:

There are two key improvements in terms of modeling.
1. Sub-LN introduces another LayerNorm inside each sublayer (i.e., multi-head self-attention, and feed-forward network): one before the input projection, and the other before the output projection.
2. Use the initialization with the theoretical derivation from DeepNet (Wang et al., 2022a), which fundamentally improves the training stability, allowing the model to be scaled up to massive sizes without pain.

There are only lines of code changes on top of the vanilla Transformer architecture. Notably, following the derivation from DeepNet, the weights of query projection and key projection are not scaled during initialization.

MAGNETO Architecture: Sub-LayerNorm

MAGNETO is built on the Sub-LayerNorm (Sub-LN). It inherits the multi-head attentions and the feed-forward network from Transformers and introduces two layer-normalization modules inside each sublayer (except the cross-attention).
For the multihead attentions, the layer normalization modules are before the qkv projection and the output projection, which can be formulated as:

where WQ, WK, WV, and WO are the parameters of the multihead self-attention.

For the feed-forward network, the layer normalization modules are before the input projection and the output projection, which are written as:

where W1 and W2 are parameters of the feed-forward layers, and φ is the non-linear activation function.

Initialization: Theoretical Derivation from DeepNet

Adopt the theoretical derivation from DeepNet (Wang et al., 2022a) to improve the training stability. (estimates the expected model update for Post-LN and introduces DeepNorm to bound the model update to a constant).
MAGNETO first estimate the expected model update of Sub-LN and then demonstrate how to bound the model update with a proper initialization.

Expected Model Update for Pre-LN

The forward propagation for an N-layer Pre-LN Transformer with N attention sub-layers and N feedforward sub-layers can be formulated as:

3) Personal Ideas:

Method 1:

Method 2:

References:

- n2 n0
- θ

Page updated

Google Sites

Report abuse

[MAGNETO] Foundation Transformers

About Me: