Decoder-only models, such as GPT, LLaMA, and Granite, predict the next token in a sequence based on preceding tokens.
Decoders operate autoregressively by predicting future tokens one at a time using previously generated tokens as context.
Masked self-attention ensures that decoders only attend to earlier tokens in the sequence during both training and inference.
Generative Pre-training (GPT) involves self-supervised learning where the model predicts the next token in a sequence.
Fine-tuning adapts pre-trained models for specific tasks (e.g., question answering or classification), often incorporating techniques like Reinforcement Learning from Human Feedback (RLHF).
Decoder models for prediction and inference tasks:
Regular training: During training, the model can use its own predictions from the previous step as the input for the next step.
Teacher forcing training: instead of feeding the model's own predictions back as inputs for subsequent time steps, the actual previous token from the sequence is used.
Casual Attention Masking: A causal attention mask with negative infinity values in its upper triangle is applied to the attention matrix. This mask ensures that each token can only attend to preceding tokens or itself, prohibiting future tokens from influencing the attention score.
BERT (Bidirectional Encoder Representations from Transformers) is a revolutionary NLP model developed by Google.
It excels at understanding word context and semantics.
It's pre-trained using self-supervised learning on large text datasets.
It utilizes an encoder-only architecture from the Transformer model.
It's designed for language comprehension tasks, not text generation.
It can be fine-tuned for various tasks like text summarization, question answering, and sentiment analysis.
Unlike autoregressive models, BERT processes the entire input sequence simultaneously.
This allows it to capture bidirectional context, leading to a deeper understanding of word relationships.
MLM involves randomly masking input tokens [MASK] and training BERT to predict the original masked words.
This helps BERT learn contextual representations.
The prediction process involves passing contextual embeddings through a layer to generate logits, and selecting the word with the highest logit value.
The model has access to the whole input sequence, unlike decoder only models.
BERT's bidirectional training allows it to understand context from both sides of a word.
This contrasts with autoregressive models like GPT, which only consider preceding text.
15% of input words are randomly masked during pre-training.
To mitigate pre-training/fine-tuning mismatch:
80% of masked words are replaced with the "[MASK]" token.
10% are replaced with a random token.
10% are left unchanged.
The model predicts the original masked words using cross-entropy loss.
'The' is masked ==> input = '[MASK]'; label = 'The'
'sun', 'set', 'behind', 'the' are unchaned ==> label = '[PAD]'
'distant' is randomly masked ==> input = '[MASK]'; label = random_token
'mountain', '.', are unchaned ==> label = '[PAD]'
BERT is pre-trained using NSP, where it learns to predict whether a sentence logically follows another.
The model is given pairs of sentences and must determine if the second sentence is a continuation of the first.
This helps BERT understand relationships between sentences.
BERT Embeddings
The input is tokenized, and special tokens are added:
[CLS] at the beginning of the sequence.
[SEP] to separate sentences.
Segment embeddings are used to distinguish which sentence each token belongs to.
Positional encodings provide information about the order of tokens.
Binary variables are used to label if the second sentence is the next sentence, or not. (0/1 - 0: NotNext - 1: IsNext)
Zero padding is used to ensure all input sequences are the same length.
NSP Task Processing
The encoder processes the input embeddings to generate contextual embeddings.
The [CLS] token's embedding (Emb1) is used for NSP classification.
A neural network is used to predict whether the second sentence follows the first.
The task is treated as a two class classification problem.
Training
The BERT model is trained by minimizing the combined loss from both NSP (z1) and MLM (z2).
This ensures accurate word prediction and sentence relationship understanding.
The BERT Label is used for MLM tasks and IsNext is used for NSP.
The first sentence is followed by the second one, so the IsNext = 1
The another case is IsNext = 0
Fine-tuning
After pre-training, BERT can be fine-tuned for specific downstream tasks like sentiment analysis.
Fine-tuning involves training BERT on a task-specific dataset.
The [CLS] token's representation is used for classification tasks.
The contextual embeddings can be used for other tasks such as vector databases.
Key components
[CLS] Token: Start of the sequence.
[SEP] Token: End of a sentence.
Segment Embeddings: Distinguish if a token belongs to the first or the second sentence.
Positional encodings: Give an idea of the order of tokens in the sequence.
Initialize Tokenizer: Begin by setting up the tokenizer using the get_tokenizer function.
Define Special Symbols: Specify special symbols (like [CLS], [SEP], [MASK], [PAD]) and assign unique index values to them.
Prepare for Masked Language Modeling (MLM):
Utilize the prepare_for_MLM function, which processes a list of tokens.
Inside this function:
Initialize necessary lists (e.g., for processed sentences, labels, raw tokens).
Apply BERT's MLM masking strategy to the tokens. This involves deciding which tokens to mask (using logic potentially encapsulated in a masking function).
Replace the selected tokens with the [MASK] token.
Create corresponding labels for each token, indicating whether it was masked or left as is.
Prepare for Next Sentence Prediction (NSP):
Use the process_for_NSP function, providing it with tokenized sentences and their corresponding masked labels (from the MLM step).
Perform checks (e.g., ensure input lists have matching lengths and sufficient sentences).
Initialize lists to store:
Sentence pairs formatted for BERT input (e.g., [CLS] sentence A [SEP] sentence B [SEP]).
The masked labels aligned with these sentence pairs.
Binary labels (IsNext / NotNext) indicating if the second sentence in a pair naturally follows the first.
Generate the actual sentence pairs (some consecutive, some random) and their corresponding binary NSP labels.
Prepare Final BERT Inputs:
Employ the prepare_BERT_final_inputs function.
Take the outputs from the previous steps (BERT input pairs, MLM labels, NSP labels) and organize them into the final lists ready to be fed into the BERT model for training.
Given a corpus, pairs of sentences are created as input.
Input is tokenized and numericalized.
Special tokens, CLS and SEP, are added.
Then it'll be zero-padded.
Next, the masking strategy is applied. According to the masking strategy, BERT labels are created where every token has a label zero, except for mask tokens.
Segment labels are also created that show to which sentence each token belongs.
Finally, the is next label shows whether the second sentence follows the first one.
Transformers process entire text sequences simultaneously, unlike sequential RNNs/LSTMs.
This significantly speeds up translation and improves context handling, especially for long texts.
This allows for a deeper and more coherent understanding of context.
The Transformer is a sequence-to-sequence model.
It consists of an encoder and a decoder.
The source text is tokenized, embedded, and positionally encoded.
The encoder's input goes through:
Embedding layer.
Positional encoding.
Multi-head attention.
Normalization layers.
Feedforward network.
The encoder outputs "memory" (contextual embeddings) containing the source sentence's information.
The decoder generates the translation one word at a time.
Key components:
Cross attention layer: attends to the encoder's memory, allowing the decoder to access the full source context.
Masking: ensures the decoder only considers preceding tokens for autoregressive generation.
Linear layer: converts contextual embeddings to logits for predicting the next token.
It starts with a beginning-of-sentence (BOS) token.
It uses the encoder's "memory."
The process repeats until an end-of-sentence (EOS) token is generated or a max length is reached.
Each generated token is embedded and positionally encoded.
Computes attention scores between decoder target positions and encoder source positions.
Helps the decoder focus on relevant parts of the input sequence.
Enables the model to capture long-range dependencies and align input/output sequences.
Masking