Learn How to Build Multi-Modal RAG with Colpali, Llama3.2-Vision, Qdrant, Reranker and MAX Serve
Help us improve and tell us what you’d like us to build next.
Request a recipe topicREADME
This recipe demonstrates how to build a powerful multi-modal RAG (Retrieval Augmented Generation) system on PDF that combines:
While this recipe focuses on PDF documents, the patterns demonstrated here can be adapted for various multi-modal applications like:
Please make sure your system meets our system requirements.
Meta Llama Model Access
HuggingFace Token
A valid Hugging Face access token at [https://huggingface.co/settings/tokens]
Set the token in your environment:
export HUGGING_FACE_HUB_TOKEN=your_token_here
This recipe requires a GPU with at least 35GB of VRAM to run efficiently. Recommended GPUs:
To proceed, ensure you have the magic
CLI installed with the magic --version
to be 0.7.2 or newer:
curl -ssL https://magic.modular.com/ | bash
or update it via:
magic self-update
Then install max-pipelines
via:
magic global install -u max-pipelines
Download the code for this recipe using the magic
CLI:
magic init multimodal-rag-with-colpali-llamavision-reranker --from modular/max-recipes/multimodal-rag-with-colpali-llamavision-reranker
cd multimodal-rag-with-colpali-llamavision-reranker
Run the application:
Make sure the port 6333
, 6334
and 8010
are available. You can adjust the port settings in Procfile.
magic run app
This command will:
meta-llama/Llama-3.2-11B-Vision
on port 8010
6333
7860
Open http://localhost:7860 in your browser to see the UI.
In this demo, we upload The Little Book of Deep Learning, which contains 185 pages that are parsed as images for question answering:
To clean up resources when done:
magic run clean
graph TB
subgraph Frontend
UI[Gradio UI]
end
subgraph Backend
PDF[PDF Processor]
ColPali[ColPali Embedder]
Qdrant[(Qdrant DB)]
Reranker[Cross-Encoder Reranker]
LLamaVision[Llama 3.2 Vision]
end
subgraph Services
MAX[MAX Serve]
end
UI --> PDF
PDF --> ColPali
ColPali --> Qdrant
UI --> LLamaVision
Qdrant --> Reranker
Reranker --> LLamaVision
LLamaVision --> MAX
The architecture consists of several key components:
Here's how a typical query flows through the system:
sequenceDiagram
participant U as User
participant G as Gradio UI
participant P as PDF Processor
participant C as ColPali
participant Q as Qdrant
participant R as Reranker
participant L as Llama Vision
U->>G: Upload PDF
G->>P: Process PDF
P->>C: Generate embeddings
C->>Q: Store vectors
U->>G: Ask question
G->>C: Embed query
C->>Q: Search similar
Q->>R: Rerank results
R->>L: Get top images
L->>G: Generate response
G->>U: Show answer + context
PDF Upload and Processing:
Query Processing:
The system processes PDFs in multiple stages:
class PDFProcessor:
def __init__(self, temp_dir="./temp_images"):
self.temp_dir = temp_dir
os.makedirs(temp_dir, exist_ok=True)
def extract_images(self, pdf_path):
"""Extract images from PDF file"""
images = []
doc = fitz.open(pdf_path)
for page_num in range(len(doc)):
page = doc.load_page(page_num)
pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72))
img = Image.open(BytesIO(pix.tobytes()))
images.append(img)
return images
Key features:
The ColPali model generates embeddings for both images and queries:
class EmbedData:
def __init__(self, embed_model_name=EMBEDDING_MODEL, batch_size=BATCH_SIZE):
self.embed_model_name = embed_model_name
self.batch_size = batch_size
self.embeddings = []
self.embed_model, self.processor = self._load_embed_model()
Optimizations include:
Qdrant handles vector storage and retrieval:
class QdrantVectorDB:
def __init__(self, collection_name, vector_dim=128, batch_size=4):
self.collection_name = collection_name
self.batch_size = batch_size
self.vector_dim = vector_dim
self.client = QdrantClient(url=QDRANT_URL, prefer_grpc=True)
Features:
The system uses a two-stage retrieval process with ColPali embeddings and cross-encoder reranking:
class Retriever:
def __init__(self, vector_db, embed_data, use_reranker=True):
self.vector_db = vector_db
self.embed_data = embed_data
self.use_reranker = use_reranker
if self.use_reranker:
logger.info("Initializing reranker...")
self.reranker = Reranker('cross-encoder')
logger.info("Reranker initialized successfully")
def search(self, query, display_limit=3, initial_fetch=10):
"""Two-stage retrieval with reranking"""
# First stage: Semantic search with ColPali embeddings
query_embedding = self.embed_data.get_query_embedding(query)
results = self.vector_db.client.query_points(
collection_name=self.vector_db.collection_name,
query_vector=query_embedding,
limit=initial_fetch # Fetch more candidates for reranking
)
if self.use_reranker and results:
try:
# Second stage: Cross-encoder reranking
docs = [f"Page {result.id}" for result in results]
doc_ids = [result.id for result in results]
reranked = self.reranker.rank(
query=query,
docs=docs,
doc_ids=doc_ids
)
# Get top K after reranking
top_results = reranked.top_k(display_limit)
# Map back to original results
final_results = []
for reranked_result in top_results:
for orig_result in results:
if orig_result.id == reranked_result.doc_id:
final_results.append(orig_result)
break
return models.QueryResponse(points=final_results)
except Exception as e:
logger.warning(f"Reranking failed: {e}. Using original order.")
return models.QueryResponse(points=results[:display_limit])
return models.QueryResponse(points=results[:display_limit])
Key features:
The reranking process:
The UI provides an intuitive interface:
class UI:
def __init__(self):
self.pdf_processor = PDFProcessor()
self.embed_data = EmbedData(batch_size=BATCH_SIZE)
self.vector_db = QdrantVectorDB(...)
self.retriever = Retriever(...)
self.rag = RAG(...)
Features:
The RAG system combines retrieval results with Llama 3.2 Vision for answer generation:
class RAG:
def __init__(self, retriever):
self.retriever = retriever
self.llm_client = client
def generate_context(self, query, display_limit=5, llm_limit=2):
"""Retrieve and prepare context for LLM"""
results = self.retriever.search(query, display_limit=display_limit)
context_images = []
page_info = []
for result in results.points:
image_b64 = result.payload.get("image")
if image_b64:
context_images.append(image_b64)
page_info.append(f"Page {result.id}")
return context_images, page_info, llm_limit
def query(self, query):
"""Generate response using RAG with Llama 3.2 Vision"""
context_images, page_info, llm_limit = self.generate_context(query)
# Prepare messages with images for LLM
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": [
{"type": "text", "text": query}
]}
]
# Add context images to the message
for img_b64 in context_images[:llm_limit]:
messages[1]["content"].append({
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{img_b64}"
}
})
response = self.llm_client.chat.completions.create(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=messages,
max_tokens=1024,
temperature=0.3
)
return response.choices[0].message.content, page_info
Key features:
You can customize the system in several ways:
# Change embedding model
EMBEDDING_MODEL = "vidore/colpali-v1.3" # Try different ColPali versions
# Adjust batch processing
BATCH_SIZE = 8 # Increase/decrease based on GPU memory
# Modify LLM parameters
temperature = 0.3 # Higher for more creative responses
max_tokens = 1024 # Adjust response length
class Retriever:
def search(self, query, display_limit=3, initial_fetch=10):
# Adjust number of candidates for reranking
initial_fetch = 10 # More candidates = better results but slower
# Configure reranker
self.reranker = Reranker(
'cross-encoder',
model_name='cross-encoder/ms-marco-MiniLM-L-6-v2'
)
class PDFProcessor:
def extract_images(self, pdf_path):
# Adjust image quality
matrix = fitz.Matrix(300/72, 300/72) # Change DPI
# Modify image format
img.save(buffered, format="JPEG", quality=85) # Adjust quality
# Customize Gradio interface
custom_css = """
.gradio-container {
font-family: 'Segoe UI', Arial, sans-serif;
}
.gr-button {
background-color: #1f6feb !important;
}
"""
# Adjust gallery display
context_gallery = gr.Gallery(
label="Context Images",
columns=2, # Change layout
height="auto"
)
Common issues and solutions:
GPU Memory Issues
BATCH_SIZE
in configurationPDF Processing Issues
Qdrant Connection Issues
MAX Serve Issues
Now that you've built a multi-modal RAG system, you can:
Enhance the System
Deploy to Production
Join the Community
#ModularAI
on social mediaWe're excited to see what you'll build with this foundation!
DETAILS
AUTHOR
Ehsan M. Kermani
AVAILABLE TASKS
magic run app
magic run clean
PROBLEMS WITH THE CODE?
File an Issue
TAGS
Help us improve and tell us what you’d like us to build next.
Request a recipe topic