In natural language processing (NLP) and sequential data analysis, the order of elements is crucial.
Ex: The cat sat on the mat".
Rearrange these words, it might end up with nonsense or a completely different meaning.
Parallel processing is part of what makes Transformers so efficient, but it also means they don't inherently undestand the concept of order ==> this pose a significant challenge when dealing with sequential data.
Transformers employ a technique called positional encoding.
The key is to find a way to encode position information directly into the input embeddings.
Absolute Positional Encoding.
Rotary Position Embedding (RoPE).
This pattern of different frequencies is the key to positional encoding. But instead of using discrete bits as binary data representation, we can use something smoother: sine and cosine waves.
The sinusoidal positional encoding method works as follows:
It generates a unique vector for each position in the sequence, using a combination of sine and cosine functions.
The encoding vector has the same dimensionality as the token embeddings, allowing them to be simply added together.
Different dimensions in the encoding vector correspond to sinusoids of different frequencies, creating a spectrum from high to low frequencies.
This approach allows the model to easily attend to relative positions, as the encoding for any fixed offset can be represented as a linear function of the encoding at a given position.
Here’s how it works:
For each position in the sequence, we generate a vector of numbers.
Each number in this vector is calculated using either a sine or cosine function.
We use different frequencies for different dimensions of the vector.
where: pos is the position, i is the dimension, and d_model is the dimension of embeddings.
The positional encodings have the same dimension d_model as the embeddings, so that the two can be summed. That is, each dimension of the positional encoding corresponds to a sinusoid.
Here’s how it works:
For each position in the sequence, we generate a vector of numbers.
Each number in this vector is calculated using either a sine or cosine function.
We use different frequencies for different dimensions of the vector.
"We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos)". Paper mentioned.
For each sine-cosine pair corresponding to a frequency ω_k, there exists a linear transformation M that can shift the position by any fixed offset φ.
In mathematical terms:
This equation tells us that we can represent the positional encoding of any position (t + φ) as a linear transformation M of the encoding at position t.
The key here is that this transformation M is independent of t, meaning it works the same way regardless of the absolute position we’re starting from.
Expanding the right side of the equation using the trigonometric addition formulas.
We find that the transformation matrix M (that looks very similar to a rotation matrix):
Expanding the Efficient Computation of Relative Positions: The model can compute the positional encoding for any relative position by applying a simple linear transformation to the encoding of the current position. This allows the attention mechanism to easily focus on relative distances between tokens.
Ease of Learning Position-Relative Patterns: While not providing true translation invariance, this property makes it easier for the model to learn patterns that depend on relative positions rather than absolute positions. For example, it can more easily learn that “not” typically modifies the meaning of nearby words, regardless of where this pattern appears in the sentence. side of the equation using the trigonometric addition formulas.
Uniqueness: Each position in the sequence gets a unique pattern of high and low frequency components. This ensures that the model can distinguish between different positions.
Relative Position Information: The sinusoidal functions have a useful property where the positional encoding for a position can be expressed as a linear function of the encoding at another position. This allows the model to more easily learn to attend to relative positions.
Bounded Range: The values of sine and cosine functions are always between -1 and 1. This ensures that the positional encodings don’t grow or shrink excessively for very long sequences, which helps maintain numerical stability.
Su et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding”
Used in: Llama[2], Llama2[3], PaLM[4], CodeGen[5]...
Instead of adding a separate positional encoding vector, RoPE applies a rotation to the existing token embeddings.
The rotation angle is a function of both the token’s position in the sequence and the dimension of the embedding.
This rotation preserves the norm of the embeddings while encoding positional information.
When computing the attention matrix, we aim to encode two essential characteristics:
Token Similarity: Tokens with similar embeddings should have a higher attention score. For example, “cat” and “dog” might have a higher score as they appear in similar contexts.
Positional Proximity: Words that are closer together in the sequence should generally have a higher score, as they’re more likely to be related.
To figure out how much one word should “pay attention” to another, we use something called an attention score. For any pair of positions m and n, it looks like this: The dot product of query q_m and key k_n. This dot product encapsulates both token similarity and positional information.
In this dot product : (q · k = ||q|| ||k|| cos(θ))
Magnitude: This contributes to the token similarity. The similarity between the magnitudes of q and k (denoted as ||q|| · ||k||) corresponds to the token embedding similarity.
Angle: The angles of Q and K (θ_m and θ_n) contribute to the positional similarity.
RoPE leverages this view in a clever way:
Instead of adding separate positional encodings, RoPE rotates the query and key vectors based on their position in the sequence.
The rotation preserves the magnitude (maintaining token similarity) while encoding positional information in the angle.
"This approach allows RoPE to seamlessly integrate both token and positional information into a single operation, making it more efficient and potentially more effective than traditional positional encoding methods."
RoPE leverages this view in a clever way:
Instead of adding separate positional encodings, RoPE rotates the query and key vectors based on their position in the sequence.
The rotation preserves the magnitude (maintaining token similarity) while encoding positional information in the angle.
"This approach allows RoPE to seamlessly integrate both token and positional information into a single operation, making it more efficient and potentially more effective than traditional positional encoding methods."
Sinusoidal positional embedding is based on the relative postion