Attention is All You Need

Transformers are generic, simples and exciting architectures designed to process a connected set of units (tokens in a sequence, pixels in an image, etc.) where the only interaction between units is through self-attention: an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence.

CNNs only look for dependencies in and around the pixel window which focuses on the local context whereas Transformers have the bigger picture because of pre-tokenization embeddings which focus on the global context.

[Paper] [Code]

Paper: https://arxiv.org/pdf/1706.03762.pdf

NYU:

https://www.youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyolxOsz6mq

https://atcold.github.io/pytorch-Deep-Learning/en/week12/12-3/

Code: https://colab.research.google.com/drive/1swXWW5sOLW8zSZBaQBYcGQkQ_Bje_bmI

Motivation, Objective, and Related Works

Motivation

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder.
The best-performing models also connect the encoder and decoder through an attention mechanism.

Objectives

Propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
The Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Related Works

CNN-based Models

- CNNs are used as basic building blocks, computing hidden representations in parallel for all input and output positions.
- Difficult to learn dependencies between distant positions.

=> In the Transformer, this is reduced to a constant number of operations.

Self-attention

- Sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
- Has been used successfully in a variety of tasks.

End-to-end Memory Networks

Based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks.

=> The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

Self-Attention

The fundamental operation of transformers is self-attention (a sequence-to-sequence operation)

Definition

Self-attention is a sequence-to-sequence operation: a sequence of vectors goes in, and a sequence of vectors comes out.

Let’s call the input vectors (of dimension k): x1, x2, ..., xt and the corresponding output vectors (of dimension k): y1, y2, ..., yt.

Fig: A visual illustration of basic self-attention.

Note: Softmax operations is not illustrated. By Peter Bloem.

To produce output vector yi, the self-attention operation simply takes a weighted average over all the input vectors:

(where j indexes over the whole sequence and the weights sum-to-one over all j)

The weight wij is derived from a function over xi and xj.
The simplest option is the dot product (with softmax to ensure that they sum to 1 over the whole sequence).

Why It Works

The dot product expresses how related two vectors in the input sequence are, with “related” defined by the learning task, and the output vectors are weighted sums over the whole input sequence, with the weights determined by these dot products.

Properties of Self-Attention

There are no parameters (yet).
Self-attention sees its input as a set, not a sequence. If we permute the input sequence, the output sequence will be exactly the same, except permuted also (i.e. self-attention is permutation equivariant).

Transformer

The transformer model follows the same general pattern as a standard sequence-to-sequence with the attention model:

The input sentence is passed through N x Encoder layers that generate output for each token in the sequence.
The N x Decoder attends to the encoder's output and its own input (self-attention) to predict the next word.

The Transformer follows this overall architecture using stacked self-attention and point-wise fully connected layers for both the encoder and decoder.

Fig. Transformer - Attention is all you need. Vaswani et al., 2017 : https://arxiv.org/abs/1706.03762.

Take a translator for English to the French language as an example:

For training, we need to give an English sentence ("The big red dog") along with its translated French sentence ("Le gros chien rouge") for the model to learn.
1. English sentences pass through the Encoder Block
2. French sentences pass through the Decoder Block.

Fig. Transformer - Translator for English to the French language.

We feed the English words to the Encoder. Then, French word and the output of the Encoder are fed to the Decoder in order to predict the next French words.

Encoder

Input: English words (The/ Big/ Red/ Dog)
Components:
1. Input Embedding Layer.
2. Positional Encoding Layer.
3. N x Encoder Self-Attention Block (Multi-head Attention).
Output: For each word, Encoder generates an attention-based representation (attention vectors) with the capability to locate a specific piece of information from a large context.

We feed the English words to the Encoder to generate output for each token in the sequence.

Attention: Define what parts of the input should be focused on and how a word is relevant to other words in the sentence.
Self-attention: sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

Fig. Attention concept.

The output is represented as an attention vector. For each word, we have an attention vector generated, which captures the contextual relationship between words in that sentence.

Input Embedding Layer

We can't train a model directly on text. The text needs to be converted to some numeric representation first.
Convert the text to sequences of token IDs, which are used as indices into an embedding.
Map every word to an embedding space (vectors or matrices). Could use pre-trained embedding space (GloVe)

=> The concept of Embedding Space is like an open space or dictionary where words of similar meanings are grouped together or are present close to each other in that space.

Positional Encoders

Words at different positions may have different meanings.
A "Positional Encoding" is added to the embedding vector in order to give the model some information about the relative position of the tokens in the sentence.

Because self-attention operation is permutation invariant, it is important to use proper positional encoding to provide order information to the model.
The positional encoding P has the same dimension as the input embedding, so it can be added to the input directly. The vanilla Transformer considered two types of encodings:

Sinusoidal positional encoding.
Learned positional encoding

Sinusoidal positional encoding: Given the token position i = 1, .., L and the dimension theta = 1, .., d:

Apply sin to even indices in the array: 2i
Apply cos to odd indices in the array: 2i+1
Output dimension (d_model = 512)

In this way, each dimension of the positional encoding corresponds to a sinusoid of different wavelengths in different dimensions, from 2pi to 10000x2pi.

Figure. Sinusoidal Positional Encoding with L=32 and d=128. The value is between -1 (black) and 1 (white) and the value 0 is in gray.

2. Learned positional encoding: Assigns each element with a learned column vector which encodes its absolute position (Gehring, et al. 2017).

Embedding Features Fusion

Word → Input Embedding + Positional Embedding → Final Vector (Context Information).
Embeddings represent a token in a d-dimensional space where tokens with similar meanings will be closer to each other. However, the embeddings do not encode the relative position of tokens in a sentence.
After adding the positional encoding, tokens will be closer to each other based on the similarity of their meaning and their position in the sentence, in the d-dimensional space.

Encoder Self-Attention Block (Multi-head Attention):

The encoder is composed of a stack of N x identical layers (gray block).
Each layer has two sub-layers:
1. Multi-head Self-attention mechanism.
2. Position-wise Fully-connected Feed-forward network.
By point-wise, it means that it applies the same linear transformation (with the same weights) to each element in the sequence. This can also be viewed as a convolutional layer with filter size 1.
A residual connection is employed around each of the two sub-layers, followed by layer normalization.
Output of each sub-layer is: LayerNorm(x + Sublayer(x)) with the same dimension d.

Problem: For each word, it weighs its value much higher on itself in the sentence.

=> We determine multiple attention vectors per word and take a weighted average, to compute the final attention vector of every word.

As we are using multiple attention vectors (N), it is called the Multi-Head Attention Block. Each of these attention vectors is independent of each other.

Parallelization can be applied here, and that makes all the difference.

Point-wise Feed Forward Layers (PFFN)

Apply a simple Feed-forward Neural Network to every attention vector in order to transform it into a form that is digestible by the next encoder or decoder block.
Point-wise feed-forward network consists of two fully-connected layers with a ReLU (or Gelu) activation in between.
Feed Forward Network accepts attention vectors “one at a time”.

Fig. Feed-forward layer

Decoder

Input: French word (Le/ Gros/ Chien/ Rouge) sequentially.
Components:
- 1. Output Embedding Layer and Positional Embedding Layer.
  2. N x Decoder Self-Attention Block (Masked Multi-head Attention).
  3. N x Encoder-Decoder Attention Block (Multi-head Cross-Attention).
  4. Feed-Forward Layers.
  5. Linear and Soft-max Layers.
The function of the Transformer decoder is to retrieve information from the encoded representation.
The architecture is quite similar to the encoder, except that the decoder contains two different multi-head attention submodules.

Embedding Layer and Positional Encoder

Similar to what is in the Encoder part.
Changes the words into respective vectors.

Decoder Self-Attention Block (Masked Multi-head Attention)

Using the self-attention concept to generate the vectors demonstrating how relevant a word is to others.
Mask is used to prevent the network to see the word that the model has not predicted yet. (The value is 0 - no green color)

Encoder-Decoder Attention Block (Multi-head Cross-Attention)

The output of the Masked Multi-head Attention block is passed together with the output of the encoder to the cross-attention block.
Multi-head cross-attention computes the relationship between each token in its input sequence and each token in the output generated by the encoder.
This block will determine how related each word vector is respected to the other.

We have one vector from every word in the English and French sentences. Calculate the relationship between them.

Feed-Forward Layers

Similar to the Encoder, convert the vector to the form that is digestible for other blocks.

Linear and Soft-max Layers

The Linear Layer is actually another feed-forward layer, used to expand the dimensions into a number of words in the French language.
The Softmax Layer transforms it into a probability distribution, which is now human-interpretable.
The final word is the word corresponding to the highest probability.

Multi-head Attention

Input: The attention module takes as input Q, K, and V, which are Query, Key, and Value, respectively.
Components:
- 1. Scaled Dot-product Attention.
  2. Single-head Attention.
  3. Multi-head Attention.
- Output: The attention vector Z.

Scaled Dot-product Attention

Compute a dot product between a word (query) and every word (key) in the sentence, providing us with weights on the relevance of the key to the query.
As the softmax normalization is done on K, its values decide the amount of importance given to Q.
These weights are then normalized and softmax is applied.
A weighted sum is then computed by applying attention weights to the corresponding words in the sentence (value), to provide a representation of the query word with more context.
The output represents the multiplication of the attention weights and the V (value) vector. This ensures that the tokens you want to focus on are kept as-is and the irrelevant tokens are flushed out.

Single-head Attention

For each word, we have a group of Q, K, and V vectors.
Q, K, and V are abstract vectors that extract different components of an input word.
The output of this module is the attention vector Z.

The dot-product attention is scaled by a factor of the square root of the depth. This is done because, for large values of depth, the dot product grows large in magnitude pushing the softmax function where it has small gradients resulting in a very hard softmax.

Multi-head Attention

Rather than only computing the attention once, the multi-head mechanism splits the inputs into smaller chunks and then computes the scaled dot-product attention over each subspace in parallel.
Aims to find the relationship between tokens in the input sequence in various different contexts.
Components:
1. Linear layers and split into heads.
2. Scaled dot-product attention.
3. Concatenation of heads.
4. Final linear layer
We utilize Multiple-head Attention, each has a weight matrix Wq, Wk, Wv. Therefore, we have multiple attention vectors Z for every word. (ex: number of h vectors in Fig.)
The output from the multi-head attention sublayer (the same size as its input) is then fed into PFFN to further transform the representation of the input sequence.

However, because NN is only expecting one attention vector per word, we use a weighted matrix Wz to make sure that the output is still an attention vector Z per word.

Normalization Layer

PyTorch Code

Scaled Dot Product Attention

class ScaledDotProductAttention(nn.Module):

""" Computes scaled dot product attention"""

def __init__(self, scale, dropout_rate):

super(ScaledDotProductAttention, self).__init__()

self.scale = scale

self.dropout_rate = dropout_rate

self.dropout = nn.Dropout(dropout_rate)

def forward(self, query, key, value, mask=None):

""" query: (batch_size, n_heads, query_len, head_dim)

key: (batch_size, n_heads, key_len, head_dim)

value: (batch_size, n_heads, value_len, head_dim)

mask: (batch_size, 1, 1, source_seq_len) for source mask

(batch_size, 1, target_seq_len, target_seq_len) for target mask

"""

# calculate alignment scores

scores = torch.matmul(query, key.transpose(-2, -1)) # (batch_size, n_heads, query_len, value_len)

scores = scores / self.scale # (batch_size, num_heads, query_len, value_len)

# mask out invalid positions

if mask is not None:

scores = scores.masked_fill(mask == 0, float('-inf')) # (batch_size, n_heads, query_len, value_len)

# calculate the attention weights (prob) from alignment scores

attn_probs = F.softmax(scores, dim=-1) # (batch_size, n_heads, query_len, value_len)

# calculate context vector

output = torch.matmul(self.dropout(attn_probs), value) # (batch_size, n_heads, query_len, head_dim)

# output: (batch_size, n_heads, query_len, head_dim)

# attn_probs: (batch_size, n_heads, query_len, value_len)

return output, attn_probs

Multihead Attention

class MultiHeadAttention(nn.Module):

""" Implements Multi-Head Self-Attention proposed by Vaswani et al., 2017.

refer https://arxiv.org/abs/1706.03762

"""

def __init__(self, d_model, n_heads, dropout_rate=0.1):

super(MultiHeadAttention, self).__init__()

assert d_model % n_heads == 0, "`d_model` should be a multiple of `n_heads`"

self.d_model = d_model

self.n_heads = n_heads

self.d_k = self.d_v = d_model // n_heads # head_dim

self.dropout_rate = dropout_rate

self.W_q = nn.Linear(d_model, d_model, bias=False)

self.W_k = nn.Linear(d_model, d_model, bias=False)

self.W_v = nn.Linear(d_model, d_model, bias=False)

self.W_o = nn.Linear(d_model, d_model)

self.attention = ScaledDotProductAttention(np.sqrt(self.d_k), dropout_rate)

def split_heads(self, x):

""" x: (batch_size, seq_len, d_model)

"""

batch_size = x.size(0)

x = x.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2) # (batch_size, n_heads, seq_len, d_k)

# x: (batch_size, n_heads, seq_len, head_dim)

return x

def group_heads(self, x):

""" x: (batch_size, n_heads, seq_len, head_dim)

"""

batch_size = x.size(0)

x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_k)

# x: (batch_size, seq_len, d_model)

return x

def forward(self, query, key, value, mask=None):

""" query: (batch_size, query_len, d_model)

key: (batch_size, key_len, d_model)

value: (batch_size, value_len, d_model)

mask: (batch_size, 1, source_seq_len) for source mask

(batch_size, target_seq_len, target_seq_len) for target mask

"""

# apply linear projections to query, key and value

Q = self.split_heads(self.W_q(query)) # (batch_size, n_heads, query_len, head_dim)

K = self.split_heads(self.W_k(key)) # (batch_size, n_heads, key_len, head_dim)

V = self.split_heads(self.W_v(value)) # (batch_size, n_heads, value_len, head_dim)

if mask is not None:

# apply same mask for all the heads

mask = mask.unsqueeze(1)

# mask: (batch_size, 1, 1, source_seq_len) for source mask

# (batch_size, 1, target_seq_len, target_seq_len) for target mask

# calculate attention weights and context vector for each of the heads

x, attn = self.attention(Q, K, V, mask)

# x: (batch_size, n_heads, query_len, head_dim)

# attn: (batch_size, n_heads, query_len, value_len)

# concatenate context vector of all the heads

x = self.group_heads(x) # (batch_size, query_len, d_model)

# apply linear projection to concatenated context vector

x = self.W_o(x) # (batch_size, query_len, d_model)

# x: (batch_size, query_len, d_model)

# attn: (batch_size, n_heads, query_len, value_len)

return x, attn

Position-wise Feed-Forward Network Module

class PositionwiseFeedForward(nn.Module):

""" Implements a two layer feed-forward network.

"""

def __init__(self, d_model, d_ff, dropout_rate=0.1):

super(PositionwiseFeedForward, self).__init__()

self.d_model = d_model

self.d_ff = d_ff

self.dropout_rate = dropout_rate

self.w_1 = nn.Linear(d_model, d_ff)

self.w_2 = nn.Linear(d_ff, d_model)

self.dropout = nn.Dropout(dropout_rate)

def forward(self, x):

""" x: (batch_size, seq_len, d_model)

"""

x = self.dropout(F.relu(self.w_1(x))) # (batch_size, seq_len, d_ff)

x = self.w_2(x) # (batch_size, seq_len, d_model)

# x: (batch_size, seq_len, d_model)

return x

Positional Encoding Module

class PositionalEncoding(nn.Module):

""" Implements the sinusoidal positional encoding.

"""

def __init__(self, d_model, dropout_rate=0.1, max_len=5000):

super(PositionalEncoding, self).__init__()

self.d_model = d_model

self.dropout_rate = dropout_rate

self.max_len = max_len

self.dropout = nn.Dropout(dropout_rate)

# compute positional encodings

pe = torch.zeros(max_len, d_model) # (max_len, d_model)

position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) # (max_len, 1)

div_term = torch.exp(

torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)

) # (d_model,)

pe[:, 0::2] = torch.sin(position * div_term)

pe[:, 1::2] = torch.cos(position * div_term)

pe = pe.unsqueeze(0).transpose(0, 1) # (max_len, 1, d_model)

self.register_buffer('pe', pe)

def forward(self, x):

""" x: (batch_size, seq_len, d_model)

"""

x = x + self.pe[:x.size(0), :] # (batch_size, seq_len, d_model)

x = self.dropout(x) # (batch_size, seq_len, d_model)

# x: (batch_size, seq_len, d_model)

return x

Encoder

class EncoderLayer(nn.Module):

""" Encoder is made up of a self-attention layer and a feed-forward layer.

"""

def __init__(self, d_model, n_heads, d_ff, dropout_rate=0.1):

super(EncoderLayer, self).__init__()

self.d_model = d_model

self.d_ff = d_ff

self.n_heads = n_heads

self.dropout_rate = dropout_rate

self.attn_layer = MultiHeadAttention(d_model, n_heads, dropout_rate)

self.attn_layer_norm = nn.LayerNorm(d_model, eps=1e-6)

self.ff_layer = PositionwiseFeedForward(d_model, d_ff, dropout_rate)

self.ff_layer_norm = nn.LayerNorm(d_model, eps=1e-6)

self.dropout = nn.Dropout(dropout_rate)

def forward(self, x, mask):

""" x: (batch_size, source_seq_len, d_model)

mask: (batch_size, 1, source_seq_len)

"""

# apply self-attention

x1, _ = self.attn_layer(x, x, x, mask) # (batch_size, source_seq_len, d_model)

# apply residual connection followed by layer normalization

x = self.attn_layer_norm(x + self.dropout(x1)) # (batch_size, source_seq_len, d_model)

# apply position-wise feed-forward

x1 = self.ff_layer(x) # (batch_size, source_seq_len, d_model)

# apply residual connection followed by layer normalization

x = self.ff_layer_norm(x + self.dropout(x1)) # (batch_size, source_seq_len, d_model)

# x: (batch_size, source_seq_len, d_model)

return x

class Encoder(nn.Module):

""" Encoder block is a stack of N identical encoder layers.

"""

def __init__(self, vocab_size, d_model, n_layers, n_heads, d_ff, pad_idx, dropout_rate=0.1, max_len=5000):

super(Encoder, self).__init__()

self.vocab_size = vocab_size

self.d_model = d_model

self.n_layers = n_layers

self.d_ff = d_ff

self.pad_idx = pad_idx

self.dropout_rate = dropout_rate

self.max_len = max_len

self.tok_embedding = nn.Embedding(vocab_size, d_model, padding_idx=pad_idx)

self.pos_embedding = PositionalEncoding(d_model, dropout_rate, max_len)

self.layers = nn.ModuleList([

EncoderLayer(d_model, n_heads, d_ff, dropout_rate)

for _ in range(n_layers)

])

self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)

def forward(self, x, mask):

""" x: (batch_size, source_seq_len)

mask: (batch_size, 1, source_seq_len)

"""

# apply positional encoding to token sequences

x = self.tok_embedding(x) # (batch_size, source_seq_len, d_model)

x = self.pos_embedding(x) # (batch_size, source_seq_len, d_model)

for layer in self.layers:

x = layer(x, mask) # (batch_size, source_seq_len, d_model)

x = self.layer_norm(x) # (batch_size, source_seq_len, d_model)

# x: (batch_size, source_seq_len, d_model)

return x

Decoder

class DecoderLayer(nn.Module):

""" Decoder is made up of a self-attention layer, a encoder-decoder attention

layer and a feed-forward layer.

"""

def __init__(self, d_model, n_heads, d_ff, dropout_rate=0.1):

super(DecoderLayer, self).__init__()

self.d_model = d_model

self.d_ff = d_ff

self.n_heads = n_heads

self.dropout_rate = dropout_rate

self.attn_layer = MultiHeadAttention(d_model, n_heads, dropout_rate)

self.attn_layer_norm = nn.LayerNorm(d_model, eps=1e-6)

self.enc_attn_layer = MultiHeadAttention(d_model, n_heads, dropout_rate)

self.enc_attn_layer_norm = nn.LayerNorm(d_model, eps=1e-6)

self.ff_layer = PositionwiseFeedForward(d_model, d_ff, dropout_rate)

self.ff_layer_norm = nn.LayerNorm(d_model, eps=1e-6)

self.dropout = nn.Dropout(dropout_rate)

def forward(self, x, memory, src_mask, tgt_mask):

""" x: (batch_size, target_seq_len, d_model)

memory: (batch_size, source_seq_len, d_model)

src_mask: (batch_size, 1, source_seq_len)

tgt_mask: (batch_size, target_seq_len, target_seq_len)

"""

# apply self-attention

x1, _ = self.attn_layer(x, x, x, tgt_mask) # (batch_size, target_seq_len, d_model)

# apply residual connection followed by layer normalization

x = self.attn_layer_norm(x + self.dropout(x1)) # (batch_size, target_seq_len, d_model)

# apply encoder-decoder attention

# memory is the output from encoder block (encoder states)

x1, attn = self.enc_attn_layer(x, memory, memory, src_mask)

# x1: (batch_size, target_seq_len, d_model)

# attn: (batch_size, n_heads, target_seq_len, source_seq_len)

# apply residual connection followed by layer normalization

x = self.attn_layer_norm(x + self.dropout(x1)) # (batch_size, target_seq_len, d_model)

# apply position-wise feed-forward

x1 = self.ff_layer(x) # (batch_size, target_seq_len, d_model)

# apply residual connection followed by layer normalization

x = self.ff_layer_norm(x + self.dropout(x1)) # (batch_size, target_seq_len, d_model)

# x: (batch_size, target_seq_len, d_model)

# attn: (batch_size, n_heads, target_seq_len, source_seq_len)

return x, attn

class Decoder(nn.Module):

""" Decoder block is a stack of N identical decoder layers.

"""

def __init__(self, vocab_size, d_model, n_layers, n_heads, d_ff, pad_idx, dropout_rate=0.1, max_len=5000):

super(Decoder, self).__init__()

self.vocab_size = vocab_size

self.d_model = d_model

self.n_layers = n_layers

self.d_ff = d_ff

self.pad_idx = pad_idx

self.dropout_rate = dropout_rate

self.max_len = max_len

self.tok_embedding = nn.Embedding(vocab_size, d_model, padding_idx=pad_idx)

self.pos_embedding = PositionalEncoding(d_model, dropout_rate, max_len)

self.layers = nn.ModuleList([

DecoderLayer(d_model, n_heads, d_ff, dropout_rate)

for _ in range(n_layers)

])

self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)

def forward(self, x, memory, src_mask, tgt_mask):

""" x: (batch_size, target_seq_len, d_model)

memory: (batch_size, source_seq_len, d_model)

src_mask: (batch_size, 1, source_seq_len)

tgt_mask: (batch_size, target_seq_len, target_seq_len)

"""

# apply positional encoding to token sequences

x = self.tok_embedding(x) # (batch_size, target_seq_len, d_model)

x = self.pos_embedding(x) # (batch_size, target_seq_len, d_model)

for layer in self.layers:

x, attn = layer(x, memory, src_mask, tgt_mask) # (batch_size, target_seq_len, d_model)

x = self.layer_norm(x) # (batch_size, target_seq_len, d_model)

# x: (batch_size, target_seq_len, d_model)

# attn: (batch_size, n_heads, target_seq_len, source_seq_len)

return x, attn

Transformer

class Transformer(nn.Module):

""" Transformer wrapper for encoder and decoder.

"""

def __init__(self, encoder, decoder, generator, pad_idx):

super(Transformer, self).__init__()

self.pad_idx = pad_idx

self.encoder = encoder

self.decoder = decoder

self.generator = generator

def get_pad_mask(self, x, pad_idx):

""" x: (batch_size, seq_len)

"""

x = (x != pad_idx).unsqueeze(-2) # (batch_size, 1, seq_len)

# x: (batch_size, 1, seq_len)

return x

def get_subsequent_mask(self, x):

""" x: (batch_size, seq_len)

"""

seq_len = x.size(1)

subsequent_mask = np.triu(np.ones((1, seq_len, seq_len)), k=1).astype(np.int8) # (batch_size, seq_len, seq_len)

subsequent_mask = (torch.from_numpy(subsequent_mask) == 0).to(x.device) # (batch_size, seq_len, seq_len)

# subsequent_mask: (batch_size, seq_len, seq_len)

return subsequent_mask

def forward(self, src, tgt):

""" src: (batch_size, source_seq_len)

tgt: (batch_size, target_seq_len)

"""

# create masks for source and target

src_mask = self.get_pad_mask(src, self.pad_idx)

tgt_mask = self.get_pad_mask(tgt, self.pad_idx) & self.get_subsequent_mask(tgt)

# src_mask: (batch_size, 1, seq_len)

# tgt_mask: (batch_size, seq_len, seq_len)

# encode the source sequence

enc_output = self.encoder(src, src_mask) # (batch_size, source_seq_len, d_model)

# decode based on source sequence and target sequence generated so far

dec_output, attn = self.decoder(tgt, enc_output, src_mask, tgt_mask)

# dec_output: (batch_size, target_seq_len, d_model)

# attn: (batch_size, n_heads, target_seq_len, source_seq_len)

# apply linear projection to obtain the output distribution

output = self.generator(dec_output) # (batch_size, target_seq_len, vocab_size)

# output: (batch_size, target_seq_len, vocab_size)

# attn: (batch_size, n_heads, target_seq_len, source_seq_len)

return output, attn

class Generator(nn.Module):

""" Linear projection layer for generating output distribution.

"""

def __init__(self, d_model, vocab_size):

super(Generator, self).__init__()

self.proj = nn.Linear(d_model, vocab_size)

def forward(self, x):

""" x: (batch_size, target_seq_len, d_model)

"""

# apply linear projection followed by softmax to obtain output distribution

x = self.proj(x) # (batch_size, target_seq_len, vocab_size)

output = F.log_softmax(x, dim=-1) # (batch_size, target_seq_len)

# output: (batch_size, target_seq_len)

return output

Follow-up "Attention is All you Need"

Read the Attention Is All You Need paper, the Transformer blog post (Transformer: A Novel Neural Network Architecture for Language Understanding), and the Tensor2Tensor announcement.
Watch Łukasz Kaiser’s talk walking through the model and its details
Play with the Jupyter Notebook provided as part of the Tensor2Tensor repo
Explore the Tensor2Tensor repo.

References

Pytorch Colab Code: https://colab.research.google.com/github/jaygala24/pytorch-implementations/blob/master/Attention%20Is%20All%20You%20Need.ipynb#scrollTo=TWOqbtFNzcU_
Peter Bloem - Transformer with Pytorch http://peterbloem.nl/blog/transformers
Jay Alammar - Visualizing A Neural Machine Translation Model: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
Jay Alammar - The Illustrated Transformer: http://jalammar.github.io/illustrated-transformer/
The Annotated Transformer: http://nlp.seas.harvard.edu/2018/04/03/attention.html
NTU - Transformers are Graph Neural Networks: https://graphdeeplearning.github.io/post/transformers-are-gnns/
Transformer with TensorFlor: https://www.tensorflow.org/text/tutorials/transformer#positional_encoding
Transfromer with Pytorch: https://colab.research.google.com/drive/1swXWW5sOLW8zSZBaQBYcGQkQ_Bje_bmI
Transformers Notebooks: https://github.com/huggingface/transformers/tree/master/notebooks
The Transformer Family: https://lilianweng.github.io/lil-log/2020/04/07/the-transformer-family.html
Transformer Neural Network: Step-By-Step Breakdown of the Beast: https://towardsdatascience.com/transformer-neural-network-step-by-step-breakdown-of-the-beast-b3e096dc857f
Lil'Log - Attention? Attention!:
1. https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/
2. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Attention and Augmented Recurrent Neural Networks: https://distill.pub/2016/augmented-rnns/
Transformer: https://zhuanlan.zhihu.com/p/377880342
https://towardsdatascience.com/into-the-transformer-5ad892e0cee#3cdc

Videos:

RASA White Board: https://www.youtube.com/watch?v=yGTUuEx3GkA&list=PL75e0qA87dlG-za8eLI6
Peter Bloem: https://www.youtube.com/watch?v=KmAISyVvE1Y
Jay Alammar: https://www.youtube.com/watch?v=-QH8fRhqFHM
Hedu - Math of Intelligence - Visual Guide to Transformer Neural Networks: https://www.youtube.com/watch?v=dichIcUZfOw
CodeEmporium - Transformer Neural Networks - EXPLAINED!: https://www.youtube.com/watch?v=TQQlZhbC5ps
Vision Transformer:
https://phamdinhkhanh.github.io/2019/06/18/AttentionLayer.html

Page updated

Google Sites

Report abuse

Attention is All You Need

Motivation, Objective, and Related Works

Motivation

Objectives

Related Works

CNN-based Models

Self-attention

End-to-end Memory Networks

Self-Attention

Definition

Why It Works

Properties of Self-Attention

Transformer

Encoder

Input Embedding Layer

Positional Encoders

Embedding Features Fusion

Encoder Self-Attention Block (Multi-head Attention):

Point-wise Feed Forward Layers (PFFN)

Decoder

Embedding Layer and Positional Encoder

Decoder Self-Attention Block (Masked Multi-head Attention)

Encoder-Decoder Attention Block (Multi-head Cross-Attention)

Feed-Forward Layers

Linear and Soft-max Layers

Multi-head Attention

Scaled Dot-product Attention

Single-head Attention

Multi-head Attention

Normalization Layer

PyTorch Code

Scaled Dot Product Attention

Multihead Attention

Position-wise Feed-Forward Network Module

Positional Encoding Module

Encoder

Decoder

Transformer

Follow-up "Attention is All you Need"

References

About Me: