The general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pre-trained on (except DistilBERT). [Link]
NLP Transformer: models have been trained on large amounts of raw text in a self-supervised fashion.
Self-supervised Learning: a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data.
Causal Language Modeling: predicting the next word in a sentence having read the n previous words. The output depends on the past and present inputs, but not the future ones.
Masked Language Modeling: the model predicts a masked word in the sentence.
From GAN to GPT-4: https://www.marktechpost.com/2023/03/21/a-history-of-generative-ai-from-gan-to-gpt-4/?fbclid=IwAR0qSh774WZA8XEuIZSvbMmoeE9f1WajSZzMHr3PHqCI5ov1597Uv4tFdcg
Others:
June 2018: GPT, the first pre-trained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results (auto-regressive)
October 2018: BERT, another large pre-trained model, this one designed to produce better summaries of sentences. (auto-encoding)
February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns
October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance
October 2019: BART and T5, two large pre-trained models using the same architecture as the original Transformer model (the first to do so) (sequence-to-sequence)
May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning)
These models, which are also called bi-directional or auto-encoding, only use the encoder during pretraining, which is usually accomplished by masking words in the input sentence and training the model to reconstruct.
At each stage during pretraining, attention layers can access all the input words. This family of models is the most useful for tasks that require understanding complete sentences such as sentence classification or extractive question answering.
Decoder models, often called auto-regressive, use only the decoder during a pretraining that is usually designed so the model is forced to predict the next word.
The attention layers can only access the words positioned before a given word in the sentence. They are best suited for tasks involving text generation.
Encoder-decoder models, also called sequence-to-sequence, use both parts of the Transformer architecture.
Attention layers of the encoder can access all the words in the input, while those of the decoder can only access the words positioned before a given word in the input. The pretraining can be done using the objectives of encoder or decoder models, but usually involves something a bit more complex.
These models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering.
Language Modeling (LM): Predict next token (in the case of unidirectional LM) or previous and next token (in the case of bidirectional LM)
Masked Language Modeling (MLM): mask out some tokens from the input sentences and then trains the model to predict the masked tokens by the rest of the tokens
Permuted Language Modeling (PLM): same as LM but on a random permutation of input sequences. A permutation is randomly sampled from all possible permutations. Then some of the tokens are chosen as the target, and the model is trained to predict these targets.
Denoising Autoencoder (DAE): take a partially corrupted input (e.g. Randomly sampling tokens from the input and replacing them with [MASK] elements. randomly deleting tokens from the input, or shuffling sentences in random order) and aim to recover the original undistorted input.
Contrastive Learning (CTL): A score function for text pairs is learned by assuming some observed pairs of text that are more semantically similar than randomly sampled text. It includes: Deep InfoMax (DIM): maximize mutual information between an image representation and local regions of the image; Replaced Token Detection (RTD): predict whether a token is replaced given its surroundings; Next Sentence Prediction (NSP): train the model to distinguish whether two input sentences are continuous segments from the training corpus; and Sentence Order Prediction (SOP): Similar to NSP, but uses two consecutive segments as positive examples, and the same segments but with their order swapped as negative examples
In-context Learning (ICL) - Zero Shot Inference:
Help LLMs learn the task being asked by adding examples or additional data in the prompt.
Problem: In small models, the result does not follow the instructions.
In-context Learning (ICL) - One Shot Inference:
Provide an example within the prompt.
Problem: Sometimes, a single example won't be enough for the model to learn what you want it to do.
In-context Learning (ICL) - Few Shot Inference:
Extend the idea of giving a single example to include multiple examples.
Problem: Sometimes, a single example won't be enough for the model to learn what you want it to do.
Generative Config - Greedy vs. Random Sampling
Greedy:
The word/ token with the highest probability is selected.
Problem: Work well for a short generation but is susceptible to repeated words or repeated sequences of words.
Random (-weighted) sampling:
Select a token using a random-weighted strategy across the probabilities of all tokens.
Introduce some variability to the word generation.
Top-k sampling:
Select an output from the top-k results after applying a random-weighted strategy using the probabilities.
Top-p sampling:
Select an output using a random-weighted strategy with the top-ranked consecutive results by probability and with a cumulative probability <= p.
Temperature:
A scaling factor that's applied within the final softmax layer of the model.
Impacts the shape of the probability distribution of the next token.
Changing the value of temperature alters the model's predictions.
Low temperature => the result from softmax is more strongly peaked with the probability being concentrated in a smaller number of words.
High temperature => the probability if more evenly spread across the tokens that leads the model to generate text with a higher degree of randomness and more variability in the output compared to a cool temperature setting.
Challenge in Training LLMs
Out of Memory
Solution:
Quantization
Distributed Data-Parallel (DDP):
Fully Shared Data Parallel (FSDP)
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.
Reduces memory by distributing (sharding) the model parameters, gradients, and optimizer state across GPUs.
Scaling Choices for Pre-training: