RAG + ChainLit
RAG in action ..
Last updated
RAG in action ..
Last updated
Follow the WSL + Docker instructions to configure environment.
Ensure Ollama is installed & configured.
Clone the repo and navigate to the folder.
git clone https://github.com/jporeilly/Workshop--LLM.git
cd Workshop--LLM/Playground/chainlit
ls
Ensure uv
is installed.
Check uv.
uv --version
uv 0.6.3
Pull the Docker image and deploy container.
docker pull qdrant/qdrant
docker run --name qdrant -p 6333:6333 -p 6334:6334 \
-v $(pwd)/qdrant_storage:/qdrant/storage:z \
qdrant/qdrant
Qdrant Docker container. Set Qdrant url in the .env
file.
QDRANT_URL_LOCALHOST="xxxxx"
Rename .env.example
to .env
Install the required packages - creates virtual env.
uv sync
Qdrant’s Web UI is an intuitive and efficient graphic interface for your Qdrant Collections, REST API and data points.
In the Console, you may use the REST API to interact with Qdrant, while in Collections, you can manage all the collections and upload Snapshots.
In the Qdrant Web UI, you can:
Run HTTP-based calls from the console
Run the chainlit app.
uv run setup-rag.py
uv run chainlit run rag-chainlit-deepseek.py -p 8501
Enter a question?
what is the score of the Open AI 01 mini and DeepSeek R1 zero on reasoning related benchmarks
This script automates the setup of a Retrieval Augmented Generation (RAG) system, which combines the power of large language models with document retrieval. The script performs three key tasks:
Document Loading: Using DoclingLoader
to read PDF documents
Document Chunking: Breaking documents into smaller, manageable pieces
Embedding Generation: Converting text chunks into vector embeddings
Metadata Handling: Ensuring all chunks have proper metadata for retrieval
A specialized database that stores text and their vector representations
Enables semantic search (finding similar concepts, not just keyword matching)
Used during inference to retrieve relevant document chunks
The LLM that will generate responses based on:
The user's question
The retrieved document chunks from the vector database
Environment and Dependency Setup
Loads configuration from .env
file
Sets up logging
Defines model and file paths
Ollama LLM Setup
Checks if Ollama server is running
Lists available models
Downloads the DeepSeek LLM if not already available
Application Update
Finds the application file
Updates model references to use the correct model name
Qdrant Database Setup
Checks if Qdrant server is running
Attempts to start Qdrant if needed (using Docker)
Provides detailed troubleshooting if connection fails
Document Processing
Loads and chunks the document
Ensures all chunks have proper metadata
Adds a 'page' field to prevent KeyError('page')
during retrieval
Vector Database Creation
Converts all chunks to embeddings
Stores the vectors in Qdrant
Associates each vector with its document text and metadata
The script includes robust error handling:
Checks for prerequisites before proceeding
Attempts automated fixes for common issues
Provides detailed error messages with troubleshooting steps
Continues with partial completion if some steps succeed
import os
import sys
import logging
import subprocess
import requests
import time
from typing import Iterator, List, Optional
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_qdrant import QdrantVectorStore
from langchain_community.document_loaders import DirectoryLoader
from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document as LCDocument
from langchain_docling import DoclingLoader
from docling.chunking import HybridChunker
from langchain_docling.loader import ExportType
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
# Load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()
# Get Qdrant URL from environment variables
qdrant_url = os.getenv("QDRANT_URL_LOCALHOST", "http://localhost:6333")
# Define model for embeddings - using a smaller, efficient sentence transformer model
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
# Set the export type - DOC_CHUNKS means documents will be exported as chunked pieces
EXPORT_TYPE = ExportType.DOC_CHUNKS
# Path to the PDF file that will be processed
FILE_PATH = "./data/DeepSeek_R1.pdf"
# Target Ollama model
TARGET_MODEL = "deepseek-llm:latest"
def check_ollama_running() -> bool:
"""Check if Ollama server is running"""
try:
response = requests.get("http://localhost:11434/api/tags", timeout=5)
return response.status_code == 200
except requests.exceptions.ConnectionError:
return False
except Exception as e:
logger.error(f"Error checking Ollama server: {str(e)}")
return False
def check_qdrant_running() -> bool:
"""Check if Qdrant server is running"""
try:
response = requests.get(f"{qdrant_url}/collections", timeout=5)
return response.status_code == 200
except requests.exceptions.ConnectionError:
return False
except Exception as e:
logger.error(f"Error checking Qdrant server: {str(e)}")
return False
def start_qdrant_container():
"""Attempt to start a Qdrant container if it's not running"""
try:
# Check if Docker is available
docker_check = subprocess.run(["docker", "--version"],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True)
if docker_check.returncode != 0:
logger.error("Docker is not available. Please install Docker or start Qdrant manually.")
return False
# Check if qdrant container exists
container_check = subprocess.run(["docker", "ps", "-a", "--filter", "name=qdrant", "--format", "{{.Names}}"],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True)
container_exists = "qdrant" in container_check.stdout
if container_exists:
# Start existing container
logger.info("Found existing Qdrant container. Attempting to start it...")
subprocess.run(["docker", "start", "qdrant"])
else:
# Create and start new container
logger.info("Creating new Qdrant container...")
subprocess.run([
"docker", "run", "-d",
"--name", "qdrant",
"-p", "6333:6333",
"-p", "6334:6334",
"-v", "qdrant_storage:/qdrant/storage",
"qdrant/qdrant"
])
# Wait for container to start
for _ in range(5):
if check_qdrant_running():
logger.info("Qdrant server is now running")
return True
logger.info("Waiting for Qdrant server to start...")
time.sleep(2)
logger.error("Qdrant server failed to start within the expected time")
return False
except Exception as e:
logger.error(f"Error starting Qdrant container: {str(e)}")
return False
def list_available_models() -> List[str]:
"""List all available models in Ollama"""
try:
response = requests.get("http://localhost:11434/api/tags")
if response.status_code == 200:
models = response.json().get("models", [])
return [model.get("name") for model in models]
return []
except requests.exceptions.ConnectionError:
return []
except Exception as e:
logger.error(f"Error listing models: {str(e)}")
return []
def pull_model(model_name: str) -> bool:
"""Pull a model from Ollama"""
logger.info(f"Pulling model: {model_name}")
try:
# Using subprocess to show progress in real-time
process = subprocess.Popen(
["ollama", "pull", model_name],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
universal_newlines=True
)
# Print output in real-time
for line in process.stdout:
print(line, end='')
sys.stdout.flush()
process.wait()
return process.returncode == 0
except Exception as e:
logger.error(f"Error pulling model: {str(e)}")
return False
def setup_ollama_model() -> bool:
"""Ensure the required Ollama model is available"""
# Check if Ollama is running
if not check_ollama_running():
logger.error("Ollama server is not running. Please start it first.")
return False
# List available models
models = list_available_models()
logger.info(f"Available models: {', '.join(models) if models else 'None'}")
# Check for DeepSeek model
if any(model.startswith("deepseek") for model in models):
logger.info(f"DeepSeek model already available")
return True
else:
logger.info(f"DeepSeek model not found. Will pull {TARGET_MODEL}")
success = pull_model(TARGET_MODEL)
if success:
logger.info(f"Successfully pulled {TARGET_MODEL}")
return True
else:
logger.error(f"Failed to pull {TARGET_MODEL}")
return False
def update_application_model() -> Optional[str]:
"""
Check for the application file and update the model reference if needed.
Returns the file path if updated successfully, None otherwise.
"""
app_file = "./rag-chainlit-deepseek.py"
if not os.path.exists(app_file):
logger.warning(f"Application file {app_file} not found. You will need to manually update your model reference.")
return None
try:
with open(app_file, 'r', encoding='utf-8') as f:
content = f.read()
# Look for the Ollama initialization pattern
if "deepseek-r1:latest" in content:
# Replace the incorrect model name with the correct one
updated_content = content.replace("deepseek-r1:latest", TARGET_MODEL)
# Write the updated content back
with open(app_file, 'w', encoding='utf-8') as f:
f.write(updated_content)
logger.info(f"Updated model reference in {app_file} from 'deepseek-r1:latest' to '{TARGET_MODEL}'")
return app_file
else:
logger.info(f"No incorrect model reference found in {app_file} or the file uses a different format.")
return None
except Exception as e:
logger.error(f"Error updating application file: {str(e)}")
return None
def create_vector_database():
"""
Creates a vector database from a PDF document using Docling for document processing
and Qdrant for vector storage.
The function:
1. Loads and chunks the document using Docling
2. Processes chunks based on export type
3. Saves the content to a markdown file
4. Creates embeddings using HuggingFace
5. Stores the embeddings in a Qdrant vector database
"""
# Check if Qdrant is running
if not check_qdrant_running():
logger.error("Qdrant server is not running. Attempting to start it...")
if not start_qdrant_container():
logger.error("""
Failed to connect to Qdrant server.
Please ensure Qdrant is installed and running:
To install and run Qdrant with Docker:
docker run -d -p 6333:6333 -p 6334:6334 -v qdrant_storage:/qdrant/storage qdrant/qdrant
Or run Qdrant locally following instructions at:
https://qdrant.tech/documentation/quick-start/
""")
return False
# Initialize DoclingLoader with specified parameters
loader = DoclingLoader(
file_path=FILE_PATH,
export_type=EXPORT_TYPE,
chunker=HybridChunker(
tokenizer=EMBED_MODEL_ID,
chunk_size=300, # Reduced chunk size
chunk_overlap=30, # Some overlap to maintain context between chunks
split_factor=0.5, # More aggressive splitting
),
)
logger.info(f"Loading document from {FILE_PATH}")
# Load and process the document
docling_documents = loader.load()
# Process the documents based on the export type
if EXPORT_TYPE == ExportType.DOC_CHUNKS:
# Create a text splitter with a smaller chunk size
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=400, # Characters, not tokens, but a safe size
chunk_overlap=50,
length_function=lambda text: len(text.split()), # Approximate token count using word count
separators=["\n\n", "\n", " ", ""]
)
# Further split any chunks that might be too large
logger.info(f"Processing {len(docling_documents)} initial chunks from Docling")
splits = []
for doc in docling_documents:
# Ensure metadata has a 'page' field to avoid KeyError('page')
if 'page' not in doc.metadata:
# Extract page number from source if available, or default to 1
page_num = doc.metadata.get('source', '').split('_')[-1].split('.')[0] if 'source' in doc.metadata else '1'
try:
doc.metadata['page'] = int(page_num)
except ValueError:
doc.metadata['page'] = 1
# Check if this chunk is potentially too large
if len(doc.page_content.split()) > 400: # If chunk has > 400 words, further split it
logger.info(f"Splitting large chunk of size ~{len(doc.page_content.split())} words")
smaller_chunks = text_splitter.split_text(doc.page_content)
# Convert the text chunks back to LangChain Documents with metadata preserved
splits.extend([
LCDocument(page_content=chunk, metadata=doc.metadata)
for chunk in smaller_chunks
])
else:
splits.append(doc)
logger.info(f"After additional splitting: {len(splits)} chunks")
elif EXPORT_TYPE == ExportType.MARKDOWN:
# Split based on markdown headers
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "Header_1"),
("##", "Header_2"),
("###", "Header_3"),
],
)
initial_splits = [split for doc in docling_documents for split in splitter.split_text(doc.page_content)]
# Further chunking
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=50,
length_function=lambda text: len(text.split()),
separators=["\n\n", "\n", " ", ""]
)
splits = []
for doc in initial_splits:
# Ensure metadata has a 'page' field
if 'page' not in doc.metadata:
doc.metadata['page'] = 1 # Default page value
if len(doc.page_content.split()) > 400:
smaller_chunks = text_splitter.split_text(doc.page_content)
splits.extend([
LCDocument(page_content=chunk, metadata=doc.metadata)
for chunk in smaller_chunks
])
else:
splits.append(doc)
logger.info(f"After markdown splitting and additional chunking: {len(splits)} chunks")
else:
# Raise an error for unsupported export types
raise ValueError(f"Unexpected export type: {EXPORT_TYPE}")
# Save the processed document to a markdown file
with open('data/output_docling.md', 'a', encoding='utf-8') as f: # utf-8 encoding for unicode support
for doc in docling_documents:
f.write(doc.page_content + '\n')
# Initialize the embedding model from HuggingFace
embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)
# Log metadata of documents for debugging
for i, doc in enumerate(splits[:3]): # Log first 3 documents as samples
logger.info(f"Document {i} metadata: {doc.metadata}")
# Check for extremely long chunks before embedding
max_token_length = max([len(doc.page_content.split()) for doc in splits])
logger.info(f"Longest chunk is approximately {max_token_length} words")
if max_token_length > 500:
logger.warning(f"Some chunks may still be too long for the embedding model (max: {max_token_length} words)")
# Create a Qdrant vector store from the document chunks
try:
# Final check to ensure Qdrant is still running
if not check_qdrant_running():
logger.error("Qdrant server connection lost. Please ensure the server is running properly.")
return False
# Create the vector store
logger.info(f"Creating vector store at {qdrant_url}")
vectorstore = QdrantVectorStore.from_documents(
documents=splits,
embedding=embedding,
url=qdrant_url,
collection_name="rag",
force_recreate=True, # Force recreate the collection if it exists
)
logger.info(f"Successfully created vector store with {len(splits)} chunks")
return True
except Exception as e:
logger.error(f"Error creating vector store: {str(e)}")
logger.error(f"""
Failed to connect to Qdrant at {qdrant_url}.
Troubleshooting steps:
1. Check if Qdrant is running:
- For Docker: run 'docker ps' to see if the container is running
- For local installation: check if the process is running
2. Verify the URL in your .env file:
- It should contain QDRANT_URL_LOCALHOST=http://localhost:6333
3. Make sure ports 6333 and 6334 are not blocked by a firewall
4. Try running Qdrant manually:
- Docker: docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
- Local: follow instructions at https://qdrant.tech/documentation/quick-start/
""")
return False
def save_documents_to_pickle(documents, output_file="data/processed_documents.pkl"):
"""Save processed documents to a pickle file as a fallback"""
import pickle
try:
with open(output_file, 'wb') as f:
pickle.dump(documents, f)
logger.info(f"Saved processed documents to {output_file} as a fallback")
return True
except Exception as e:
logger.error(f"Error saving documents to pickle: {str(e)}")
return False
def main():
"""Main function to run the complete RAG setup"""
logger.info("Starting RAG setup process...")
# Step 1: Set up Ollama model
logger.info("STEP 1: Setting up Ollama model...")
if not setup_ollama_model():
logger.error("Failed to set up Ollama model. Exiting.")
return
# Step 2: Update application model reference if needed
logger.info("STEP 2: Checking application model reference...")
updated_file = update_application_model()
if updated_file:
logger.info(f"Successfully updated model reference in {updated_file}")
else:
logger.warning("No automatic model update performed. Please check your application file manually.")
# Step 3: Create vector database
logger.info("STEP 3: Creating vector database...")
db_success = create_vector_database()
if db_success:
logger.info("Vector database created successfully!")
# Final step: Display success message
logger.info("""
=====================================================
RAG SETUP COMPLETED SUCCESSFULLY!
What's been done:
1. Checked/pulled the DeepSeek LLM model in Ollama
2. Updated application file model reference (if found)
3. Created vector database with proper metadata
You can now run your Chainlit application:
$ chainlit run rag-chainlit-deepseek.py
=====================================================
""")
else:
logger.error("""
=====================================================
PARTIAL SETUP COMPLETED
What's been done:
1. Checked/pulled the DeepSeek LLM model in Ollama ✓
2. Updated application file model reference ✓
3. Failed to create vector database ✗
Please troubleshoot your Qdrant database connection
before running your Chainlit application.
=====================================================
""")
if __name__ == "__main__":
main()
Documents are split into smaller pieces to:
Fit within embedding model context windows
Allow for more granular retrieval
Enable more focused responses
The script employs two levels of chunking:
Initial chunking with HybridChunker
from Docling
Secondary chunking with RecursiveCharacterTextSplitter
for any chunks that are still too large
Text embeddings are numerical representations of text that capture semantic meaning. Our script uses Sentence Transformers (all-MiniLM-L6-v2
) which:
Creates 384-dimensional vectors
Positions semantically similar text closer together in vector space
Enables "meaning-based" search rather than just keyword matching
When a user asks a question:
The question is converted to an embedding
The vector database finds chunks with similar embeddings
The most relevant chunks are sent to the LLM along with the question
The LLM generates a response based on the question and retrieved information
Standard Libraries:
os
: For handling environment variables and path operations
typing
: Provides Iterator
type hint
Document Processing:
langchain_docling
: Integrates Docling with LangChain
docling.chunking
: Provides the HybridChunker
for intelligent document chunking
Embedding and Vector Storage:
langchain_huggingface.embeddings
: For generating document embeddings
langchain_qdrant
: Connects to Qdrant vector database
Other LangChain Components:
Various text splitters and document loaders
Environment Setup:
dotenv
: For loading environment variables from a .env
file
EMBED_MODEL_ID
: Uses "sentence-transformers/all-MiniLM-L6-v2", which is a lightweight sentence transformer model for creating embeddings
EXPORT_TYPE
: Set to ExportType.DOC_CHUNKS
, meaning the document will be exported as chunked pieces
FILE_PATH
: Points to a PDF file ("./data/DeepSeek_R1.pdf") that will be processed
create_vector_database()
This function handles the entire process of creating a vector database from a PDF document:
# Check if Qdrant is running
if not check_qdrant_running():
logger.error("Qdrant server is not running. Attempting to start it...")
if not start_qdrant_container():
logger.error("""
Failed to connect to Qdrant server.
Please ensure Qdrant is installed and running:
To install and run Qdrant with Docker:
docker run -d -p 6333:6333 -p 6334:6334 -v qdrant_storage:/qdrant/storage qdrant/qdrant
Or run Qdrant locally following instructions at:
https://qdrant.tech/documentation/quick-start/
""")
return False
This begins with a critical connection check to ensure Qdrant is available, with automatic recovery attempts and detailed error instructions if it fails.
loader = DoclingLoader(
file_path=FILE_PATH,
export_type=EXPORT_TYPE,
chunker=HybridChunker(
tokenizer=EMBED_MODEL_ID,
chunk_size=300, # Reduced chunk size
chunk_overlap=30, # Some overlap to maintain context between chunks
split_factor=0.5, # More aggressive splitting
),
)
logger.info(f"Loading document from {FILE_PATH}")
docling_documents = loader.load()
This initializes a DoclingLoader
with optimized parameters to ensure chunks are small enough for the embedding model's context window. This addresses the "Token indices sequence length is longer than the specified maximum sequence length" warning.
if EXPORT_TYPE == ExportType.DOC_CHUNKS:
# Create a text splitter with a smaller chunk size
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=400, # Characters, not tokens, but a safe size
chunk_overlap=50,
length_function=lambda text: len(text.split()), # Approximate token count using word count
separators=["\n\n", "\n", " ", ""]
)
# Further split any chunks that might be too large
logger.info(f"Processing {len(docling_documents)} initial chunks from Docling")
splits = []
for doc in docling_documents:
# Ensure metadata has a 'page' field to avoid KeyError('page')
if 'page' not in doc.metadata:
# Extract page number from source if available, or default to 1
page_num = doc.metadata.get('source', '').split('_')[-1].split('.')[0] if 'source' in doc.metadata else '1'
try:
doc.metadata['page'] = int(page_num)
except ValueError:
doc.metadata['page'] = 1
# Check if this chunk is potentially too large
if len(doc.page_content.split()) > 400: # If chunk has > 400 words, further split it
logger.info(f"Splitting large chunk of size ~{len(doc.page_content.split())} words")
smaller_chunks = text_splitter.split_text(doc.page_content)
# Convert the text chunks back to LangChain Documents with metadata preserved
splits.extend([
LCDocument(page_content=chunk, metadata=doc.metadata)
for chunk in smaller_chunks
])
else:
splits.append(doc)
logger.info(f"After additional splitting: {len(splits)} chunks")
This applies secondary chunking to split any remaining large chunks, addressing the token length limitations. It also ensures each chunk has the critical page
metadata field to avoid the KeyError('page')
error during retrieval.
with open('data/output_docling.md', 'a', encoding='utf-8') as f: # utf-8 encoding for unicode support
for doc in docling_documents:
f.write(doc.page_content + '\n')
This saves the processed document chunks to a markdown file with explicit UTF-8 encoding to handle special characters, preventing UnicodeEncodeError
issues.
# Initialize the embedding model from HuggingFace
embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)
# Log metadata of documents for debugging
for i, doc in enumerate(splits[:3]): # Log first 3 documents as samples
logger.info(f"Document {i} metadata: {doc.metadata}")
# Check for extremely long chunks before embedding
max_token_length = max([len(doc.page_content.split()) for doc in splits])
logger.info(f"Longest chunk is approximately {max_token_length} words")
if max_token_length > 500:
logger.warning(f"Some chunks may still be too long for the embedding model (max: {max_token_length} words)")
This initializes the embedding model and adds critical validation steps to detect potential issues before they cause errors.
try:
# Final check to ensure Qdrant is still running
if not check_qdrant_running():
logger.error("Qdrant server connection lost. Please ensure the server is running properly.")
return False
# Create the vector store
logger.info(f"Creating vector store at {qdrant_url}")
vectorstore = QdrantVectorStore.from_documents(
documents=splits,
embedding=embedding,
url=qdrant_url,
collection_name="rag",
force_recreate=True, # Force recreate the collection if it exists
)
logger.info(f"Successfully created vector store with {len(splits)} chunks")
return True
except Exception as e:
logger.error(f"Error creating vector store: {str(e)}")
logger.error(f"""
Failed to connect to Qdrant at {qdrant_url}.
Troubleshooting steps:
1. Check if Qdrant is running:
- For Docker: run 'docker ps' to see if the container is running
- For local installation: check if the process is running
2. Verify the URL in your .env file:
- It should contain QDRANT_URL_LOCALHOST=http://localhost:6333
3. Make sure ports 6333 and 6334 are not blocked by a firewall
4. Try running Qdrant manually:
- Docker: docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
- Local: follow instructions at https://qdrant.tech/documentation/quick-start/
""")
return False
This wrapped the vector store creation in a try-except block with comprehensive error handling and detailed troubleshooting instructions. It also adds the force_recreate=True
parameter to ensure clean recreation of collections.
The script includes a more robust main function that coordinates the entire process:
def main():
"""Main function to run the complete RAG setup"""
logger.info("Starting RAG setup process...")
# Step 1: Set up Ollama model
logger.info("STEP 1: Setting up Ollama model...")
if not setup_ollama_model():
logger.error("Failed to set up Ollama model. Exiting.")
return
# Step 2: Update application model reference if needed
logger.info("STEP 2: Checking application model reference...")
updated_file = update_application_model()
if updated_file:
logger.info(f"Successfully updated model reference in {updated_file}")
else:
logger.warning("No automatic model update performed. Please check your application file manually.")
# Step 3: Create vector database
logger.info("STEP 3: Creating vector database...")
db_success = create_vector_database()
# Display appropriate success/error message based on results
if db_success:
logger.info("Vector database created successfully!")
# Display success message with next steps
else:
logger.error("PARTIAL SETUP COMPLETED")
# Display partial success message with troubleshooting guidance
if __name__ == "__main__":
main()
This main function provides a clear workflow with distinct steps, proper error handling, and informative output at each stage.
chatbot@Office:~/Workshop--LLM/Playground/chainlit$ uv run setup_rag.py
2025-02-27 15:12:45,677 - INFO - Starting RAG setup process...
2025-02-27 15:12:45,677 - INFO - STEP 1: Setting up Ollama model...
2025-02-27 15:12:45,681 - INFO - Available models: deepseek-llm:latest, phi4:latest
2025-02-27 15:12:45,681 - INFO - DeepSeek model already available
2025-02-27 15:12:45,681 - INFO - STEP 2: Checking application model reference...
2025-02-27 15:12:45,681 - INFO - No incorrect model reference found in ./rag-chainlit-deepseek.py or the file uses a different format.
2025-02-27 15:12:45,681 - WARNING - No automatic model update performed. Please check your application file manually.
2025-02-27 15:12:45,681 - INFO - STEP 3: Creating vector database...
2025-02-27 15:12:45,682 - ERROR - Qdrant server is not running. Attempting to start it...
2025-02-27 15:12:45,716 - INFO - Creating new Qdrant container...
Unable to find image 'qdrant/qdrant:latest' locally
latest: Pulling from qdrant/qdrant
c29f5b76f736: Already exists
fb9c768cb3bb: Pull complete
4f4fb700ef54: Pull complete
7b3a2bcc7760: Pull complete
93f5a85b768c: Pull complete
567ae78ca994: Pull complete
a06ec4f8101e: Pull complete
d197820764df: Pull complete
Digest: sha256:318c11b72aaab96b36e9662ad244de3cabd0653a1b942d4e8191f18296c81af0
Status: Downloaded newer image for qdrant/qdrant:latest
962fbf757fdbd7b2b5035c7382d812a97cb66664a8092777fa4ae2a1b877f154
2025-02-27 15:12:49,636 - INFO - Waiting for Qdrant server to start...
2025-02-27 15:12:51,639 - INFO - Qdrant server is now running
2025-02-27 15:12:51,857 - INFO - Loading document from ./data/DeepSeek_R1.pdf
2025-02-27 15:12:51,874 - INFO - Going to convert document batch...
2025-02-27 15:12:52,260 - INFO - Accelerator device: 'cuda:0'
2025-02-27 15:12:54,305 - INFO - Accelerator device: 'cuda:0'
/home/chatbot/Workshop--LLM/Playground/chainlit/.venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
2025-02-27 15:12:55,151 - INFO - Accelerator device: 'cuda:0'
2025-02-27 15:12:55,548 - INFO - Processing document DeepSeek_R1.pdf
2025-02-27 15:13:08,283 - INFO - Finished converting document DeepSeek_R1.pdf in 16.43 sec.
Token indices sequence length is longer than the specified maximum sequence length for this model (1280 > 512). Running this sequence through the model will result in indexing errors
2025-02-27 15:13:08,638 - INFO - Processing 61 initial chunks from Docling
2025-02-27 15:13:08,638 - INFO - Splitting large chunk of size ~417 words
2025-02-27 15:13:08,640 - INFO - After additional splitting: 63 chunks
2025-02-27 15:13:08,882 - INFO - Use pytorch device_name: cuda
2025-02-27 15:13:08,882 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
2025-02-27 15:13:10,530 - INFO - Document 0 metadata: {'source': './data/DeepSeek_R1.pdf', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/1', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 265.74798583984375, 't': 659.0969848632812, 'r': 329.52899169921875, 'b': 648.7059936523438, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 11]}]}, {'self_ref': '#/texts/2', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 237.2270050048828, 't': 635.4180297851562, 'r': 358.04998779296875, 'b': 626.4650268554688, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 21]}]}], 'headings': ['DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 999333483494855273, 'filename': 'DeepSeek_R1.pdf'}}, 'page': 1}
2025-02-27 15:13:10,530 - INFO - Document 1 metadata: {'source': './data/DeepSeek_R1.pdf', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/4', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 70.31800079345703, 't': 551.405029296875, 'r': 526.3280029296875, 'b': 411.89801025390625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 887]}]}, {'self_ref': '#/texts/5', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'caption', 'prov': [{'page_no': 1, 'bbox': {'l': 173.2050018310547, 't': 118.31200408935547, 'r': 422.0740051269531, 'b': 107.67300415039062, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 48]}]}], 'headings': ['Abstract'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 999333483494855273, 'filename': 'DeepSeek_R1.pdf'}}, 'page': 1}
2025-02-27 15:13:10,530 - INFO - Document 2 metadata: {'source': './data/DeepSeek_R1.pdf', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/tables/0', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'document_index', 'prov': [{'page_no': 2, 'bbox': {'l': 69.50955200195312, 't': 719.6405029296875, 'r': 525.8222045898438, 'b': 217.884033203125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 0]}]}], 'headings': ['Contents'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 999333483494855273, 'filename': 'DeepSeek_R1.pdf'}}, 'page': 1}
2025-02-27 15:13:10,530 - INFO - Longest chunk is approximately 400 words
2025-02-27 15:13:10,533 - INFO - Creating vector store at http://localhost:6333
2025-02-27 15:13:10,602 - INFO - HTTP Request: GET http://localhost:6333/collections/rag/exists "HTTP/1.1 200 OK"
2025-02-27 15:13:10,782 - INFO - HTTP Request: PUT http://localhost:6333/collections/rag "HTTP/1.1 200 OK"
2025-02-27 15:13:11,163 - INFO - HTTP Request: PUT http://localhost:6333/collections/rag/points?wait=true "HTTP/1.1 200 OK"
2025-02-27 15:13:11,164 - INFO - Successfully created vector store with 63 chunks
2025-02-27 15:13:11,165 - INFO - Vector database created successfully!
2025-02-27 15:13:11,165 - INFO -
=====================================================
RAG SETUP COMPLETED SUCCESSFULLY!
What's been done:
1. Checked/pulled the DeepSeek LLM model in Ollama
2. Updated application file model reference (if found)
3. Created vector database with proper metadata
You can now run your Chainlit application:
$ chainlit run rag-chainlit-deepseek.py -p 8501
=====================================================
To process your own documents:
FILE_PATH = "./path/to/your/document.pdf"
For longer or more complex documents:
chunker=HybridChunker(
tokenizer=EMBED_MODEL_ID,
chunk_size=200, # Smaller chunks
chunk_overlap=50, # More overlap for context
split_factor=0.7, # More aggressive splitting
),
For different language models or domains:
EMBED_MODEL_ID = "sentence-transformers/all-mpnet-base-v2" # More powerful, but slower
To use a different language model:
TARGET_MODEL = "llama3:latest" # Or any other model available in Ollama
Check if Qdrant is running: docker ps | grep qdrant
Start Qdrant manually: docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
Verify URL in .env
: QDRANT_URL_LOCALHOST=http://localhost:6333
Check if Ollama is running: ollama list
Start Ollama: ollama serve
Pull models manually: ollama pull deepseek-llm:latest
Check file path and permissions
For large documents, reduce chunk size
For complex documents with tables, consider specialized loaders
Start your Chainlit application:
chainlit run rag-chainlit-deepseek.py -p 8501
Ask questions about your document
The application will:
Convert your question to an embedding
Find relevant document chunks
Send both to the LLM
Generate a response based on the document
Monitor performance and refine as needed:
Adjust chunk sizes
Try different embedding models
Experiment with different retrieval settings
This script implements a Retrieval-Augmented Generation (RAG) system using LangChain, Ollama, and Chainlit. The system allows users to ask questions through a chat interface, and the application will:
Retrieve relevant documents from a vector database (Qdrant)
Combine these documents with the user's question in a prompt
Send this prompt to a language model (DeepSeek LLM)
Return the model's response to the user along with the sources of information
# Import standard library modules
import os # For environment variable access and file operations
# Import type hints for better code documentation
from typing import Iterable # For type hinting collections that can be iterated over
# Import LangChain document handling
from langchain_core.documents import Document as LCDocument # Core document class for LangChain
# Import LangChain prompt handling
from langchain.prompts import ChatPromptTemplate # For creating structured prompts for chat models
# Import embedding model from HuggingFace integration
from langchain_huggingface.embeddings import HuggingFaceEmbeddings # For text-to-vector conversions
# Import LangChain runnable components for building the pipeline
from langchain.schema.runnable import Runnable, RunnablePassthrough, RunnableConfig # For creating processing pipelines
from langchain.schema import StrOutputParser # For parsing LLM outputs as strings
# Import callback handling for tracking pipeline operations
from langchain.callbacks.base import BaseCallbackHandler # Base class for creating custom callbacks
# Import Ollama integration for accessing local LLMs
from langchain_ollama import OllamaLLM # For interfacing with locally running Ollama models
# Import Qdrant vector database integration
from langchain_qdrant import QdrantVectorStore # For connecting to Qdrant vector database
# Import Chainlit for building the chat interface
import chainlit as cl # Web-based chat interface framework
# Load environment variables from .env file
from dotenv import load_dotenv # For loading environment variables from a .env file
load_dotenv() # Execute the loading of environment variables
# Get the Qdrant database URL from environment variables
qdrant_url = os.getenv("QDRANT_URL_LOCALHOST") # URL for the local Qdrant instance
# Define the embedding model to use - this converts text to vector embeddings
# all-MiniLM-L6-v2 is a lightweight, efficient embedding model with good performance
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
# Check if the model exists and download if needed
import subprocess
import json
def check_and_download_model(model_name):
try:
# List all models
result = subprocess.run(["ollama", "list"], capture_output=True, text=True, check=True)
model_list = result.stdout.strip().split('\n')
# Skip header row and check if our model exists
model_exists = any(model_name in model_line for model_line in model_list[1:] if model_line)
if not model_exists:
print(f"Model {model_name} not found. Downloading...")
subprocess.run(["ollama", "pull", model_name], check=True)
print(f"Downloaded {model_name} successfully.")
else:
print(f"Model {model_name} is already available.")
except subprocess.CalledProcessError as e:
print(f"Error checking or downloading model: {e}")
raise
# Check and download the model if needed
check_and_download_model("deepseek-llm:latest")
# Initialize the language model using Ollama
# deepseek-llm is the specific model being used for generating responses
llm = OllamaLLM(
model="deepseek-llm:latest" # Using the latest version of the deepseek-llm model
)
# This decorator registers this function to run when a new chat session starts
@cl.on_chat_start
async def on_chat_start():
# Define the prompt template that instructs the LLM how to answer
# {context} will be filled with retrieved documents
# {question} will be filled with the user's query
template = """Answer the question based only on the following context:
{context}
Question: {question}
"""
# Create a prompt object from the template string
prompt = ChatPromptTemplate.from_template(template)
# Helper function to format a list of documents into a single string
# This combines all retrieved document contents with newlines as separators
def format_docs(docs):
return "\n\n".join([d.page_content for d in docs])
# Initialize the embedding model for converting text to vectors
embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)
# Connect to an existing Qdrant collection named "rag"
# This collection should already contain embedded documents
vectorstore = QdrantVectorStore.from_existing_collection(
embedding=embedding, # The embedding model to use for query conversion
collection_name="rag", # The name of the collection in Qdrant
url=qdrant_url # The URL of the Qdrant server
)
# Create a retriever from the vector store
# This will be used to find relevant documents based on query similarity
retriever = vectorstore.as_retriever()
# Build the RAG pipeline using LangChain's runnable interface
# This defines the sequence of operations that will process each user query
runnable = (
# Step 1: Prepare inputs for the prompt template
{
# Retrieve documents and format them into a single context string
"context": retriever | format_docs,
# Pass the user's question through unchanged
"question": RunnablePassthrough()
}
# Step 2: Fill the prompt template with the context and question
| prompt
# Step 3: Send the filled prompt to the language model
| llm
# Step 4: Parse the LLM output as a string
| StrOutputParser()
)
# Store the runnable in the user's session for reuse with each message
cl.user_session.set("runnable", runnable)
# This decorator registers this function to handle each new user message
@cl.on_message
async def on_message(message: cl.Message):
# Retrieve the runnable from the user's session
runnable = cl.user_session.get("runnable") # type: Runnable
# Create an empty message that will be populated with the response
# The content will be streamed in chunks as it's generated
msg = cl.Message(content="")
# Define a custom callback handler to track and display document sources
class PostMessageHandler(BaseCallbackHandler):
"""
Callback handler for handling the retriever and LLM processes.
Used to post the sources of the retrieved documents as a Chainlit element.
"""
def __init__(self, msg: cl.Message):
BaseCallbackHandler.__init__(self)
self.msg = msg # Store reference to the message being built
self.sources = set() # Use a set to store unique source-page pairs
# This method is called when document retrieval is complete
def on_retriever_end(self, documents, *, run_id, parent_run_id, **kwargs):
# Extract source and page information from each retrieved document
for d in documents:
source_page_pair = (d.metadata['source'], d.metadata['page'])
self.sources.add(source_page_pair) # Add unique pairs to the set
# This method is called when the LLM finishes generating a response
def on_llm_end(self, response, *, run_id, parent_run_id, **kwargs):
# If we have sources to display, format them and add as an element
if len(self.sources):
# Create a formatted string of sources with page references
sources_text = "\n".join([f"{source}#page={page}" for source, page in self.sources])
# Add the sources as a text element to the message
self.msg.elements.append(
cl.Text(name="Sources", content=sources_text, display="inline")
)
# Stream the response from the runnable, processing the user's message
async for chunk in runnable.astream(
message.content, # Pass the user's message content to the runnable
config=RunnableConfig(callbacks=[
cl.LangchainCallbackHandler(), # Standard Chainlit-LangChain integration
PostMessageHandler(msg) # Our custom handler for tracking sources
]),
):
# Stream each token to the UI as it's generated
await msg.stream_token(chunk)
# Send the complete message once streaming is finished
await msg.send()
import os
import subprocess
import json
from typing import Iterable
os
: Used to access environment variables (e.g., os.getenv("QDRANT_URL_LOCALHOST")
)
subprocess
: Added to execute Ollama commands for checking and downloading models
json
: For parsing JSON responses, though not actively used in the current implementation
typing.Iterable
: Used for type hinting, indicating collections that can be iterated over
from langchain_core.documents import Document as LCDocument
Provides the core Document
class which is the fundamental unit in LangChain for text content
Each document contains text content (page_content
) and metadata (like source, page number)
Renamed to LCDocument
to avoid naming conflicts
from langchain.prompts import ChatPromptTemplate
ChatPromptTemplate
: Creates structured prompts for language models
Allows template variables (like {context}
and {question}
) to be filled in dynamically
Formats the prompt in a way that the model understands
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
Provides integration with HuggingFace's embedding models
Used to convert text into vector representations (embeddings)
These embeddings capture semantic meaning, enabling similarity-based retrieval
from langchain.schema.runnable import Runnable, RunnablePassthrough, RunnableConfig
from langchain.schema import StrOutputParser
Runnable
: Base interface for components that can be executed
RunnablePassthrough
: Passes input directly to output without modifications
RunnableConfig
: Configuration for runnable components, including callback settings
StrOutputParser
: Converts LLM outputs to simple strings
from langchain.callbacks.base import BaseCallbackHandler
Provides the BaseCallbackHandler
class for creating custom callbacks
Callbacks are triggered at different stages of the pipeline execution
Used in this script to track retrieved documents and their sources
from langchain_ollama import OllamaLLM
Provides the OllamaLLM
class for connecting to locally running Ollama models
Handles communication with the Ollama API
Manages model loading, inference requests, and response parsing
from langchain_qdrant import QdrantVectorStore
Integrates with Qdrant, a vector similarity search database
Provides methods to store, retrieve, and search for vector embeddings
Enables semantic search across document collections
import chainlit as cl
Web-based chat interface framework
Provides decorators like @cl.on_chat_start
and @cl.on_message
to define app behavior
Handles streaming, UI elements, and user session management
from dotenv import load_dotenv
load_dotenv()
qdrant_url = os.getenv("QDRANT_URL_LOCALHOST")
load_dotenv()
: Loads environment variables from a .env
file into the environment
The .env
file should contain QDRANT_URL_LOCALHOST=http://localhost:6333
or similar
This separation of configuration allows the script to be used in different environments without code changes
def check_and_download_model(model_name):
try:
# List all models
result = subprocess.run(["ollama", "list"], capture_output=True, text=True, check=True)
model_list = result.stdout.strip().split('\n')
# Skip header row and check if our model exists
model_exists = any(model_name in model_line for model_line in model_list[1:] if model_line)
if not model_exists:
print(f"Model {model_name} not found. Downloading...")
subprocess.run(["ollama", "pull", model_name], check=True)
print(f"Downloaded {model_name} successfully.")
else:
print(f"Model {model_name} is already available.")
except subprocess.CalledProcessError as e:
print(f"Error checking or downloading model: {e}")
raise
check_and_download_model("deepseek-llm:latest")
subprocess.run()
: Executes shell commands and captures their output
["ollama", "list"]
: Lists all available Ollama models
The output is parsed to check if our model exists, skipping the header row
If the model isn't found, subprocess.run(["ollama", "pull", model_name])
downloads it
Proper error handling with try/except blocks captures and reports any issues
The function is called with "deepseek-llm:latest" before initializing the model
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
llm = OllamaLLM(
model="deepseek-llm:latest"
)
EMBED_MODEL_ID
: Specifies the HuggingFace model to use for embeddings
"sentence-transformers/all-MiniLM-L6-v2" is chosen for its efficiency and performance
It balances quality and speed, making it good for production systems
OllamaLLM
: Initializes the language model using Ollama
"deepseek-llm:latest" is the DeepSeek LLM model for response generation
"latest" ensures using the most recent version available
@cl.on_chat_start
async def on_chat_start():
@cl.on_chat_start
: A decorator that registers this function to run once when a new chat session starts
async
: Indicates this is an asynchronous function, allowing non-blocking operations
Prompt Template Definition
template = """Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
Defines a string template with placeholders for context and question
The template instructs the LLM to answer based only on the provided context
ChatPromptTemplate.from_template()
converts this string into a prompt object
Document Formatting
def format_docs(docs):
return "\n\n".join([d.page_content for d in docs])
A helper function that takes a list of document objects
Extracts the page_content
from each document
Joins them with double newlines to create a single context string
Embedding Model Initialization
embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)
Initializes the embedding model using the specified model ID
This model converts text to vector embeddings for similarity search
Vector Store Connection
vectorstore = QdrantVectorStore.from_existing_collection(
embedding=embedding,
collection_name="rag",
url=qdrant_url
)
Connects to an existing Qdrant collection named "rag"
Uses the initialized embedding model for query conversion
The collection should already contain embedded documents
url=qdrant_url
specifies the Qdrant server location
Retriever Creation
retriever = vectorstore.as_retriever()
Creates a retriever object from the vector store
By default, this retriever will return the top-k most similar documents
Can be configured with parameters like search_kwargs={"k": 4}
to specify the number of documents
RAG Pipeline Construction
runnable = (
{
"context": retriever | format_docs,
"question": RunnablePassthrough()
}
| prompt
| llm
| StrOutputParser()
)
Defines a pipeline using LangChain's runnable interface:
retriever | format_docs
: Retrieves relevant documents and formats them into a context string
RunnablePassthrough()
: Passes the user's question through unchanged
| prompt
: Fills the prompt template with the context and question
| llm
: Sends the filled prompt to the language model
| StrOutputParser()
: Parses the LLM output as a simple string
The pipe operator (|
) chains these operations together
Session Storage
cl.user_session.set("runnable", runnable)
Stores the runnable pipeline in the user's session
Makes it available for reuse with each message in the conversation
Avoids recreating the pipeline for every message
@cl.on_message
async def on_message(message: cl.Message):
@cl.on_message
: A decorator that registers this function to handle each new user message
Takes a cl.Message
object containing the user's input
Runnable Retrieval
runnable = cl.user_session.get("runnable") # type: Runnable
Retrieves the previously stored runnable from the user's session
Type hint indicates it's a Runnable
object
Message Creation
msg = cl.Message(content="")
Creates an empty message that will be populated with the response
The content will be streamed in chunks as it's generated
Custom Callback Handler
class PostMessageHandler(BaseCallbackHandler):
def __init__(self, msg: cl.Message):
BaseCallbackHandler.__init__(self)
self.msg = msg
self.sources = set()
def on_retriever_end(self, documents, *, run_id, parent_run_id, **kwargs):
for d in documents:
source_page_pair = (d.metadata['source'], d.metadata['page'])
self.sources.add(source_page_pair)
def on_llm_end(self, response, *, run_id, parent_run_id, **kwargs):
if len(self.sources):
sources_text = "\n".join([f"{source}#page={page}" for source, page in self.sources])
self.msg.elements.append(
cl.Text(name="Sources", content=sources_text, display="inline")
)
Defines a custom callback handler that extends BaseCallbackHandler
Tracks the sources of retrieved documents
Has two main methods:
on_retriever_end
: Called when document retrieval is complete
Extracts source and page information from each document's metadata
Adds unique source-page pairs to a set to avoid duplicates
on_llm_end
: Called when the LLM finishes generating a response
Formats the collected sources into a text string
Adds this as a text element to the message
Response Streaming
async for chunk in runnable.astream(
message.content,
config=RunnableConfig(callbacks=[
cl.LangchainCallbackHandler(),
PostMessageHandler(msg)
]),
):
await msg.stream_token(chunk)
await msg.send()
runnable.astream()
: Processes the user's message asynchronously, streaming the results
message.content
: The text of the user's message
RunnableConfig(callbacks=[...])
: Configures callbacks for the execution:
cl.LangchainCallbackHandler()
: Standard Chainlit-LangChain integration
PostMessageHandler(msg)
: Our custom handler for tracking sources
await msg.stream_token(chunk)
: Streams each token to the UI as it's generated
await msg.send()
: Sends the complete message once streaming is finished
x
x
x
x
If you’ve set up a deployment locally with the Qdrant , navigate to http://localhost:6333/dashboard.
If you’ve set up a deployment in a cloud cluster, find your Cluster URL in your cloud dashboard, at . Add :6333/dashboard
to the end of the URL.
List and search existing