Build a Q&A App with Multi-Modal RAG using Gemini Pro
Bhushan Garware, Aditya Rane, Leonid Kuligin
{Retrieval Augmented Generation (RAG), Large Language Models (LLMs)}
Bhushan Garware, Aditya Rane, Leonid Kuligin
{Retrieval Augmented Generation (RAG), Large Language Models (LLMs)}
Introduction (Giới thiệu)
Installation
Building Multi-Modal RAG (Xây dựng RAG Đa phương thức)
Step 1: Install and Import dependencies (Bước 1: Cài đặt và Nhập các phụ thuộc)
Step 2: Prepare and load data (Bước 2: Chuẩn bị và tải dữ liệu)
Step 3: Generate Text Summaries (Bước 3: Tạo Tóm tắt Văn bản)
Step 4: Generate Image Summaries (Bước 4: Tạo Tóm tắt Hình ảnh)
Step 5: Build Multi-Vector Retrieval (Bước 5: Xây dựng Truy xuất Đa Vector)
Step 6: Building Multi-Modal RAG (Bước 6: Xây dựng RAG Đa phương thức)
Step 7: Test your queries (Bước 7: Kiểm tra các truy vấn của bạn)
Clean up (Dọn dẹp)
Congratulations (Xin chúc mừng)
Retrieval Augmented Generation (RAG), is a technique that combines the power of large language models (LLMs) with the ability to retrieve relevant information from external knowledge sources. This means an LLM doesn't just rely on its internal training data, but can also access and incorporate up-to-date, specific information when generating responses.
RAG is gaining popularity for several reasons:
Increased accuracy and relevance: RAG allows LLMs to provide more accurate and relevant responses by grounding them in factual information retrieved from external sources. This is particularly useful in scenarios where up-to-date knowledge is crucial, such as answering questions about current events or providing information on specific topics.
Reduced hallucinations: LLMs can sometimes generate responses that seem plausible but are actually incorrect or nonsensical. RAG helps mitigate this problem by verifying the information generated against external sources.
Greater adaptability: RAG makes LLMs more adaptable to different domains and tasks. By leveraging different knowledge sources, an LLM can be easily customized to provide information on a wide range of topics.
Enhanced user experience: RAG can improve the overall user experience by providing more informative, reliable, and relevant responses.
In today's data-rich world, documents often combine text and images to convey information comprehensively. However, most Retrieval Augmented Generation (RAG) systems overlook the valuable insights locked within images. As multi-modal Large Language Models (LLMs) gain prominence, it's crucial to explore how we can leverage visual content alongside text in RAG, unlocking a deeper understanding of the information landscape.
Multimodal Embeddings: The multimodal embeddings model generates 1408-dimension vectors* based on the input you provide, which can include a combination of image, text, and video data. The image embedding vector and text embedding vector are in the same semantic space with the same dimensionality. Consequently, these vectors can be used interchangeably for use cases like searching image by text, or searching video by image. Have a look at this Demo.
Use multi-modal embedding to embed text and images.
Retrieve both using similarity search.
Pass both retrieved raw image and text-chunks to multi-modal LLM for answer synthesis.
Text Embeddings:
Use multi-modal LLM to generate text summaries of the images.
Embedded and retrieve text.
Pass text chucks to LLM for answer synthesis.
The multi-vector retrieval employs summaries of the document sections to retrieve original content for answer synthesis. It enhances the quality of the RAG especially for the table, graphs, charts etc. intensive tasks. Find more details at Langchain's blog.
Developing question-answering system using Gemini Pro
Imagine you have documents containing complex graphs or diagrams packed with information. You want to extract this data to answer questions or queries.
In this codelab, you'll perform the following:
Data loading using LangChain document_loaders
Generate text summaries using Google's gemini-pro model
Generate image summaries using Google's gemini-pro-vision model
Create multi-vector retrieval using Google's textembedding-gecko model with Croma Db as vector store
Develop Multi-modal RAG chain for question answering
Developing question-answering system using Gemini Pro
Imagine you have documents containing complex graphs or diagrams packed with information. You want to extract this data to answer questions or queries.
In this codelab, you'll perform the following:
Data loading using LangChain document_loaders
Generate text summaries using Google's gemini-pro model
Generate image summaries using Google's gemini-pro-vision model
Create multi-vector retrieval using Google's textembedding-gecko model with Croma Db as vector store
Develop Multi-modal RAG chain for question answering