CS480/680 - Lecture 19:
Attention and Transformer Networks
[Vaswani et al., Attention is All You Need, NeurIPS, 2017]
[Vaswani et al., Attention is All You Need, NeurIPS, 2017]
1) Attention Definition:
Attention in Computer Vision: used to highlight important parts of an image that contribute to a desired output.
Attention in NLP: Aligned machine translation; AND, Language modeling with Transformer networks
2) Sequence Modeling:
Challenges with RNNs:
Long range dependencies
Gradient vanishing and explosion
Large # of training steps
Recurrence prevents parallel computation
Transformer Networks:
Facilitate long range dependencies
No gradient vanishing and explosion
Fewer training steps
No recurrence that facilitate parallel computation
3) Retrieval:
Given the Database with keys and corresponding values.
When a query is issued, it will be aligned with different keys.
The matched key will retrieve the value as the output.
4) Attention Mechanism:
Mimics the retrieval of a value vi for a query q based on a key ki in database.
Query: q
Key: ki
Similarity: si = f(q,ki)
dot product.
scaled dot product: keep the similarity in a scale.
general dot product: project to another space (transforming the query into the same space with key)
additive similarity.
Weights: ai
Value: vi
Fig. Calculate the similarity btw query q and key k.
Fig. How an attention mechanism is calculated. k: vector; S: scalar; a: scalar; v: vector
Example: Machine Translation
Query: si-1 (hidden vector for (i-1)th output word)
Key: hj (hidden vector for jth input word)
Value: hj (hidden vector for jth input word)
5) Transformer Network:
Vaswani et al., (2017) Attention is all you need.
Encoder-decoder based on attention (no recurrence)
5.1 Multihead Attention (in Encoder):
Multihead attention: compute multiple attentions per query with different weights
Given pairs of key and value (k, v) as the database, and the query that we are going to compare to each key, the greatest similarity is going to have the highest weights. We can take a weighted combination of the corresponding values to produce the output.
linear: the projection function (convolutional layers). we could use multiple functions at linear (ex: 3) - in convolution, we use different filters.
5.2 Masked Multihead Attention (in Decoder):
Masked multi-head attention: multi-head where some values are masked (i.e., probabilities of masked values are nullified to prevent them from being selected).
When decoding, an output value should only depend on previous outputs (not future outputs, because we haven't produced yet). Hence we mask future outputs.
In other words, we remove the connections on the term that we have not produced yet.
Masked Attention: Adding a mask M that effectively produces some probabilities that are zero (or minus infinitive) for the terms that we don't want to attend to (future terms).
*/* The exponential of the minus infinitive is Zero.
Question: why don't we add the term M outside the soft-max?
Answer: Because we want to create a proper distribution with the sum of 1 for the soft-max. If we add a term outside the soft-max, it may cause the wrong distribution.
5.3 Layer normalization:
Adding the norm layers to ensure that the output of that layer regardless of how we set the weights.
Similar to Batch Normalization but the difference is that we're doing the normalization at the level of a layer.
5.3 Positional embedding:
The attention mechanism does not care where the words are in the sentences (position). However, the ordering of the words is important.
The positional embedding carries information about the position which allows us to distinguish each word so that the sentence still retains its ordering information.
Adding a vector that is knowns as the positional encoding and that vector is different depending on what is the word's position.
i: entry of word.
We have a position, which is an integer (Sccalar). Then we embed it into a vector. Now the vector has multiple entries.
Each entry will be computed using above function. For even entries, we use the Sine function. The Cosine function is for odd entries.
This vector is later simply added to the input embedding vector.
6) Comparison:
Attention reduces sequential operations and maximum path length, which facilitates long range dependencies.
A layer contains n position. When we compute an embedding vetor for each position, it would have dimensionality D
7) Results:
8) GPT and GPT-2:
GPT relies on Transformer.
GPT-2 was released in 2018.
9) BERT (Bidirectional Encoder Representations from Transformers):
Devlin et al., (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Decoder transformer that predicts a missing word based on surrounding words by computing P(xt|x1..t-1, t+1..T)
Mask missing word with masked multi-head attention.
Improved state of the art on 11 tasks