RAG Framework
Introduction to RAG, RAG process, Encoders (DPR), Tokenizers, FAISS library, Vector Search
Introduction to RAG, RAG process, Encoders (DPR), Tokenizers, FAISS library, Vector Search
What is RAG?
Retrieval Augmented Generation (RAG) is an AI framework.
It optimizes the output of Large Language Models (LLMs).
It combines LLM capabilities with specific domain knowledge or internal databases without retraining the model.
Problem RAG Solves:
Pre-trained LLMs can struggle with specific domain knowledge they weren't trained on.
They may provide inaccurate responses to specialized queries (e.g., confidential company policies).
RAG provides external, relevant knowledge to ensure more accurate, domain-specific responses.
Components:
Retriever: The core component responsible for finding relevant information from the knowledge base.
Generator: Functions like a chatbot (LLM) to generate the final natural language response.
The input prompt/question is converted into a high-dimensional vector (embedding) using a question encoder.
Knowledge base documents are broken into smaller chunks.
Each text chunk is converted into a high-dimensional vector (embedding) using a context encoder and indexed (often in a vector database).
Prompt Encoding: Uses token embedding (like BERT or GPT) for each word/sub-word, then averages these vectors to represent the entire prompt.
Context Encoding: Similar process of token embedding and averaging applied to each text chunk from the knowledge base.
The system compares the prompt vector with the text chunk vectors in the knowledge base.
It uses distance metrics (like cosine similarity or dot product) to find the vectors (and corresponding text chunks) most similar to the prompt vector.
The top K (e.g., 3-5) most relevant chunks are selected.
The text from the retrieved relevant chunks is combined with the original input prompt/question.
This combined "augmented query" is fed into the LLM (the generator component).
The LLM generates the final response based on both the original query and the retrieved contextual information.
RAG for Dynamic Information Needs:
Retrieval Augmented Generation (RAG) is useful for designing chatbots that handle frequently changing information, like company policies.
It combines a language model's power with real-time information retrieval.
Allows chatbots to query updated databases (knowledge base) dynamically, providing current responses without needing frequent retraining.
This improves accuracy and relevance while reducing maintenance.
Key RAG Components (Encoder/Faiss Focus):
Retriever:
Encodes user prompts and relevant documents (context) into vectors.
Stores context vectors in a vector database.
Retrieves relevant context vectors based on distance calculations.
Generator:
Combines the original prompt with the retrieved context to create the final response.
Context Encoder (Dense Passage Retrieval - DPR):
Encodes potential answer passages or documents (the context) into vector embeddings.
Process:
Import DPR context encoder and tokenizer (e.g., from transformers library).
Load pre-trained tokenizer and encoder models.
Tokenize context documents (input text pairs/passages), applying padding/truncation (e.g., max length 256) and converting to tensors (containing input IDs, token type IDs, attention mask).
Pass tokenized input through the context encoder to generate context vector embeddings (e.g., shape [number_of_passages, embedding_dimension]).
Faiss (Facebook AI Similarity Search):
A library for efficient similarity searching through large collections of high-dimensional vectors.
Used in RAG to calculate the distance between the question embedding and the database of context embeddings.
Process:
Import Faiss library.
Convert context embeddings (from the encoder) into a NumPy array (e.g., float32).
Initialize a Faiss index object (specifying the distance metric, e.g., L2/Euclidean).
Add the context embeddings to the Faiss index to make them searchable.
Question Encoder (DPR):
Encodes input questions into fixed-dimensional vector embeddings.
Captures the meaning and context of the question to facilitate searching.
Process:
Import DPR question encoder and tokenizer.
Load pre-trained tokenizer and encoder models.
Tokenize the input question.
Pass the tokenized question through the question encoder to generate the question embedding.
Retrieval Process using Faiss:
Generate the question embedding using the question encoder.
Use the Faiss index's search method with the question embedding to find the top K (e.g., top 3) closest context embeddings.
The search returns:
Distances (D): Lower values indicate closer matches/higher relevance.
Indices (I): The positions of the closest context embeddings in the original dataset/Faiss index.
Use the returned indices (I) to retrieve the actual text of the corresponding relevant context paragraphs/documents.
Answer Generation (Using a Decoder like BART):
Without RAG Context:
Import and load a decoder model (e.g., BartForConditionalGeneration) and its tokenizer.
Tokenize only the input question.
Use the model's generate function (with parameters like max length, beam search) to create an answer based solely on the question and the model's pre-trained knowledge.
Decode the generated tokens into readable text.
Limitation: May not answer accurately if the required information (e.g., specific company policies) wasn't in its training data.
With RAG Context:
First, perform the retrieval step (using Faiss) to get the relevant context text (top_contexts) based on the question.
Combine the retrieved context text with the original question.
Tokenize this combined input for the decoder model (e.g., BART).
Generate the answer using the model, which now considers both the question and the provided relevant context.
This allows the chatbot to provide accurate, context-specific answers without needing to be retrained on the context data itself.