LLM
  • Overview
    • LLM
  • Key Concepts
    • Models
    • Key Concepts
  • Quckstart
    • Jan.ai
    • 🦙Ollama & Chatbox
  • Playground
  • Workflows
    • n8n
    • Flowise
      • Basic Chatbot
      • Research Agent
      • Company Information Agent
      • PDF Parser
      • Knowledge Base
      • Lead Gen
    • Agno
  • LLM Projects
    • RAG + ChainLit
  • Model Context Protocol
    • Claude + MCPs
    • Flowise + MCPs
  • Knowledge Graphs
    • neo4j
    • WhyHow.ai
  • Setup
    • WSL & Docker
    • 🦙Quickstart
    • Key Concepts
    • Workflows
    • Model Context Protocol
Powered by GitBook
On this page
  1. LLM Projects

RAG + ChainLit

RAG in action ..

PreviousLLM ProjectsNextModel Context Protocol

Last updated 1 month ago

Follow the WSL + Docker instructions to configure environment.

Ensure Ollama is installed & configured.

  1. Clone the repo and navigate to the folder.

git clone https://github.com/jporeilly/Workshop--LLM.git
cd Workshop--LLM/Playground/chainlit
ls
  1. Ensure uv is installed.

  1. Check uv.

uv --version
uv 0.6.3
  1. Pull the Docker image and deploy container.

    docker pull qdrant/qdrant
    docker run --name qdrant -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant
  2. Qdrant Docker container. Set Qdrant url in the .env file.

QDRANT_URL_LOCALHOST="xxxxx"
  1. Rename .env.example to .env

  2. Install the required packages - creates virtual env.

    uv sync

Qdrant Web UI

You can manage both local and cloud Qdrant deployments through the Web UI.

Access the Web UI

Qdrant’s Web UI is an intuitive and efficient graphic interface for your Qdrant Collections, REST API and data points.

In the Console, you may use the REST API to interact with Qdrant, while in Collections, you can manage all the collections and upload Snapshots.

Qdrant Web UI features

In the Qdrant Web UI, you can:

  • Run HTTP-based calls from the console


ChainLit

  1. Run the chainlit app.

uv run setup-rag.py
uv run chainlit run rag-chainlit-deepseek.py -p 8501
  1. Enter a question?

what is the score of the Open AI 01 mini and DeepSeek R1 zero on reasoning related benchmarks

RAG System Setup Script

What is this Script?

This script automates the setup of a Retrieval Augmented Generation (RAG) system, which combines the power of large language models with document retrieval. The script performs three key tasks:

Document Processing Pipeline

  • Document Loading: Using DoclingLoader to read PDF documents

  • Document Chunking: Breaking documents into smaller, manageable pieces

  • Embedding Generation: Converting text chunks into vector embeddings

  • Metadata Handling: Ensuring all chunks have proper metadata for retrieval

Vector Database (Qdrant)

  • A specialized database that stores text and their vector representations

  • Enables semantic search (finding similar concepts, not just keyword matching)

  • Used during inference to retrieve relevant document chunks

Language Model (DeepSeek)

  • The LLM that will generate responses based on:

    • The user's question

    • The retrieved document chunks from the vector database

Script Architecture and Flow

  1. Environment and Dependency Setup

    • Loads configuration from .env file

    • Sets up logging

    • Defines model and file paths

  2. Ollama LLM Setup

    • Checks if Ollama server is running

    • Lists available models

    • Downloads the DeepSeek LLM if not already available

  3. Application Update

    • Finds the application file

    • Updates model references to use the correct model name

  4. Qdrant Database Setup

    • Checks if Qdrant server is running

    • Attempts to start Qdrant if needed (using Docker)

    • Provides detailed troubleshooting if connection fails

  5. Document Processing

    • Loads and chunks the document

    • Ensures all chunks have proper metadata

    • Adds a 'page' field to prevent KeyError('page') during retrieval

  6. Vector Database Creation

    • Converts all chunks to embeddings

    • Stores the vectors in Qdrant

    • Associates each vector with its document text and metadata


Error Handling and Resilience

The script includes robust error handling:

  • Checks for prerequisites before proceeding

  • Attempts automated fixes for common issues

  • Provides detailed error messages with troubleshooting steps

  • Continues with partial completion if some steps succeed

import os
import sys
import logging
import subprocess
import requests
import time
from typing import Iterator, List, Optional
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_qdrant import QdrantVectorStore
from langchain_community.document_loaders import DirectoryLoader
from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document as LCDocument
from langchain_docling import DoclingLoader
from docling.chunking import HybridChunker
from langchain_docling.loader import ExportType

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()

# Get Qdrant URL from environment variables
qdrant_url = os.getenv("QDRANT_URL_LOCALHOST", "http://localhost:6333")

# Define model for embeddings - using a smaller, efficient sentence transformer model
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"

# Set the export type - DOC_CHUNKS means documents will be exported as chunked pieces
EXPORT_TYPE = ExportType.DOC_CHUNKS

# Path to the PDF file that will be processed
FILE_PATH = "./data/DeepSeek_R1.pdf"  

# Target Ollama model
TARGET_MODEL = "deepseek-llm:latest"

def check_ollama_running() -> bool:
    """Check if Ollama server is running"""
    try:
        response = requests.get("http://localhost:11434/api/tags", timeout=5)
        return response.status_code == 200
    except requests.exceptions.ConnectionError:
        return False
    except Exception as e:
        logger.error(f"Error checking Ollama server: {str(e)}")
        return False

def check_qdrant_running() -> bool:
    """Check if Qdrant server is running"""
    try:
        response = requests.get(f"{qdrant_url}/collections", timeout=5)
        return response.status_code == 200
    except requests.exceptions.ConnectionError:
        return False
    except Exception as e:
        logger.error(f"Error checking Qdrant server: {str(e)}")
        return False

def start_qdrant_container():
    """Attempt to start a Qdrant container if it's not running"""
    try:
        # Check if Docker is available
        docker_check = subprocess.run(["docker", "--version"], 
                                     stdout=subprocess.PIPE, 
                                     stderr=subprocess.PIPE,
                                     text=True)
        
        if docker_check.returncode != 0:
            logger.error("Docker is not available. Please install Docker or start Qdrant manually.")
            return False
        
        # Check if qdrant container exists
        container_check = subprocess.run(["docker", "ps", "-a", "--filter", "name=qdrant", "--format", "{{.Names}}"],
                                       stdout=subprocess.PIPE,
                                       stderr=subprocess.PIPE,
                                       text=True)
        
        container_exists = "qdrant" in container_check.stdout
        
        if container_exists:
            # Start existing container
            logger.info("Found existing Qdrant container. Attempting to start it...")
            subprocess.run(["docker", "start", "qdrant"])
        else:
            # Create and start new container
            logger.info("Creating new Qdrant container...")
            subprocess.run([
                "docker", "run", "-d",
                "--name", "qdrant",
                "-p", "6333:6333",
                "-p", "6334:6334",
                "-v", "qdrant_storage:/qdrant/storage",
                "qdrant/qdrant"
            ])
        
        # Wait for container to start
        for _ in range(5):
            if check_qdrant_running():
                logger.info("Qdrant server is now running")
                return True
            logger.info("Waiting for Qdrant server to start...")
            time.sleep(2)
        
        logger.error("Qdrant server failed to start within the expected time")
        return False
        
    except Exception as e:
        logger.error(f"Error starting Qdrant container: {str(e)}")
        return False

def list_available_models() -> List[str]:
    """List all available models in Ollama"""
    try:
        response = requests.get("http://localhost:11434/api/tags")
        if response.status_code == 200:
            models = response.json().get("models", [])
            return [model.get("name") for model in models]
        return []
    except requests.exceptions.ConnectionError:
        return []
    except Exception as e:
        logger.error(f"Error listing models: {str(e)}")
        return []

def pull_model(model_name: str) -> bool:
    """Pull a model from Ollama"""
    logger.info(f"Pulling model: {model_name}")
    try:
        # Using subprocess to show progress in real-time
        process = subprocess.Popen(
            ["ollama", "pull", model_name],
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            universal_newlines=True
        )
        
        # Print output in real-time
        for line in process.stdout:
            print(line, end='')
            sys.stdout.flush()
        
        process.wait()
        return process.returncode == 0
    except Exception as e:
        logger.error(f"Error pulling model: {str(e)}")
        return False

def setup_ollama_model() -> bool:
    """Ensure the required Ollama model is available"""
    # Check if Ollama is running
    if not check_ollama_running():
        logger.error("Ollama server is not running. Please start it first.")
        return False
    
    # List available models
    models = list_available_models()
    logger.info(f"Available models: {', '.join(models) if models else 'None'}")
    
    # Check for DeepSeek model
    if any(model.startswith("deepseek") for model in models):
        logger.info(f"DeepSeek model already available")
        return True
    else:
        logger.info(f"DeepSeek model not found. Will pull {TARGET_MODEL}")
        success = pull_model(TARGET_MODEL)
        if success:
            logger.info(f"Successfully pulled {TARGET_MODEL}")
            return True
        else:
            logger.error(f"Failed to pull {TARGET_MODEL}")
            return False

def update_application_model() -> Optional[str]:
    """
    Check for the application file and update the model reference if needed.
    Returns the file path if updated successfully, None otherwise.
    """
    app_file = "./rag-chainlit-deepseek.py"
    
    if not os.path.exists(app_file):
        logger.warning(f"Application file {app_file} not found. You will need to manually update your model reference.")
        return None
    
    try:
        with open(app_file, 'r', encoding='utf-8') as f:
            content = f.read()
        
        # Look for the Ollama initialization pattern
        if "deepseek-r1:latest" in content:
            # Replace the incorrect model name with the correct one
            updated_content = content.replace("deepseek-r1:latest", TARGET_MODEL)
            
            # Write the updated content back
            with open(app_file, 'w', encoding='utf-8') as f:
                f.write(updated_content)
            
            logger.info(f"Updated model reference in {app_file} from 'deepseek-r1:latest' to '{TARGET_MODEL}'")
            return app_file
        else:
            logger.info(f"No incorrect model reference found in {app_file} or the file uses a different format.")
            return None
    except Exception as e:
        logger.error(f"Error updating application file: {str(e)}")
        return None

def create_vector_database():
    """
    Creates a vector database from a PDF document using Docling for document processing
    and Qdrant for vector storage.
    
    The function:
    1. Loads and chunks the document using Docling
    2. Processes chunks based on export type
    3. Saves the content to a markdown file
    4. Creates embeddings using HuggingFace
    5. Stores the embeddings in a Qdrant vector database
    """
    
    # Check if Qdrant is running
    if not check_qdrant_running():
        logger.error("Qdrant server is not running. Attempting to start it...")
        if not start_qdrant_container():
            logger.error("""
            Failed to connect to Qdrant server. 
            
            Please ensure Qdrant is installed and running:
            
            To install and run Qdrant with Docker:
              docker run -d -p 6333:6333 -p 6334:6334 -v qdrant_storage:/qdrant/storage qdrant/qdrant
            
            Or run Qdrant locally following instructions at:
              https://qdrant.tech/documentation/quick-start/
            """)
            return False
    
    # Initialize DoclingLoader with specified parameters
    loader = DoclingLoader(
        file_path=FILE_PATH,
        export_type=EXPORT_TYPE,
        chunker=HybridChunker(
            tokenizer=EMBED_MODEL_ID,
            chunk_size=300,  # Reduced chunk size
            chunk_overlap=30,  # Some overlap to maintain context between chunks
            split_factor=0.5,  # More aggressive splitting
        ),
    )
    
    logger.info(f"Loading document from {FILE_PATH}")
    
    # Load and process the document
    docling_documents = loader.load()
    
    # Process the documents based on the export type
    if EXPORT_TYPE == ExportType.DOC_CHUNKS:
        # Create a text splitter with a smaller chunk size
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=400,  # Characters, not tokens, but a safe size
            chunk_overlap=50,
            length_function=lambda text: len(text.split()),  # Approximate token count using word count
            separators=["\n\n", "\n", " ", ""]
        )
        
        # Further split any chunks that might be too large
        logger.info(f"Processing {len(docling_documents)} initial chunks from Docling")
        splits = []
        for doc in docling_documents:
            # Ensure metadata has a 'page' field to avoid KeyError('page')
            if 'page' not in doc.metadata:
                # Extract page number from source if available, or default to 1
                page_num = doc.metadata.get('source', '').split('_')[-1].split('.')[0] if 'source' in doc.metadata else '1'
                try:
                    doc.metadata['page'] = int(page_num)
                except ValueError:
                    doc.metadata['page'] = 1
            
            # Check if this chunk is potentially too large
            if len(doc.page_content.split()) > 400:  # If chunk has > 400 words, further split it
                logger.info(f"Splitting large chunk of size ~{len(doc.page_content.split())} words")
                smaller_chunks = text_splitter.split_text(doc.page_content)
                # Convert the text chunks back to LangChain Documents with metadata preserved
                splits.extend([
                    LCDocument(page_content=chunk, metadata=doc.metadata) 
                    for chunk in smaller_chunks
                ])
            else:
                splits.append(doc)
        
        logger.info(f"After additional splitting: {len(splits)} chunks")
        
    elif EXPORT_TYPE == ExportType.MARKDOWN:
        # Split based on markdown headers
        splitter = MarkdownHeaderTextSplitter(
            headers_to_split_on=[
                ("#", "Header_1"),
                ("##", "Header_2"),
                ("###", "Header_3"),
            ],
        )
        initial_splits = [split for doc in docling_documents for split in splitter.split_text(doc.page_content)]
        
        # Further chunking
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=400,
            chunk_overlap=50,
            length_function=lambda text: len(text.split()),
            separators=["\n\n", "\n", " ", ""]
        )
        
        splits = []
        for doc in initial_splits:
            # Ensure metadata has a 'page' field
            if 'page' not in doc.metadata:
                doc.metadata['page'] = 1  # Default page value
                
            if len(doc.page_content.split()) > 400:
                smaller_chunks = text_splitter.split_text(doc.page_content)
                splits.extend([
                    LCDocument(page_content=chunk, metadata=doc.metadata) 
                    for chunk in smaller_chunks
                ])
            else:
                splits.append(doc)
                
        logger.info(f"After markdown splitting and additional chunking: {len(splits)} chunks")
    else:
        # Raise an error for unsupported export types
        raise ValueError(f"Unexpected export type: {EXPORT_TYPE}")
    
    
    # Save the processed document to a markdown file
    with open('data/output_docling.md', 'a', encoding='utf-8') as f:  # utf-8 encoding for unicode support
        for doc in docling_documents:
            f.write(doc.page_content + '\n')
    
    
    # Initialize the embedding model from HuggingFace
    embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)
    
    # Log metadata of documents for debugging
    for i, doc in enumerate(splits[:3]):  # Log first 3 documents as samples
        logger.info(f"Document {i} metadata: {doc.metadata}")
    
    # Check for extremely long chunks before embedding
    max_token_length = max([len(doc.page_content.split()) for doc in splits])
    logger.info(f"Longest chunk is approximately {max_token_length} words")
    
    if max_token_length > 500:
        logger.warning(f"Some chunks may still be too long for the embedding model (max: {max_token_length} words)")
    
    # Create a Qdrant vector store from the document chunks
    try:
        # Final check to ensure Qdrant is still running
        if not check_qdrant_running():
            logger.error("Qdrant server connection lost. Please ensure the server is running properly.")
            return False
            
        # Create the vector store
        logger.info(f"Creating vector store at {qdrant_url}")
        vectorstore = QdrantVectorStore.from_documents(
            documents=splits,
            embedding=embedding,
            url=qdrant_url,
            collection_name="rag",
            force_recreate=True,  # Force recreate the collection if it exists
        )
        logger.info(f"Successfully created vector store with {len(splits)} chunks")
        return True
    except Exception as e:
        logger.error(f"Error creating vector store: {str(e)}")
        logger.error(f"""
        Failed to connect to Qdrant at {qdrant_url}.
        
        Troubleshooting steps:
        1. Check if Qdrant is running: 
           - For Docker: run 'docker ps' to see if the container is running
           - For local installation: check if the process is running
        
        2. Verify the URL in your .env file:
           - It should contain QDRANT_URL_LOCALHOST=http://localhost:6333
        
        3. Make sure ports 6333 and 6334 are not blocked by a firewall
        
        4. Try running Qdrant manually:
           - Docker: docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
           - Local: follow instructions at https://qdrant.tech/documentation/quick-start/
        """)
        return False

def save_documents_to_pickle(documents, output_file="data/processed_documents.pkl"):
    """Save processed documents to a pickle file as a fallback"""
    import pickle
    try:
        with open(output_file, 'wb') as f:
            pickle.dump(documents, f)
        logger.info(f"Saved processed documents to {output_file} as a fallback")
        return True
    except Exception as e:
        logger.error(f"Error saving documents to pickle: {str(e)}")
        return False

def main():
    """Main function to run the complete RAG setup"""
    logger.info("Starting RAG setup process...")
    
    # Step 1: Set up Ollama model
    logger.info("STEP 1: Setting up Ollama model...")
    if not setup_ollama_model():
        logger.error("Failed to set up Ollama model. Exiting.")
        return
    
    # Step 2: Update application model reference if needed
    logger.info("STEP 2: Checking application model reference...")
    updated_file = update_application_model()
    if updated_file:
        logger.info(f"Successfully updated model reference in {updated_file}")
    else:
        logger.warning("No automatic model update performed. Please check your application file manually.")
    
    # Step 3: Create vector database
    logger.info("STEP 3: Creating vector database...")
    db_success = create_vector_database()
    
    if db_success:
        logger.info("Vector database created successfully!")
        # Final step: Display success message
        logger.info("""
        =====================================================
        RAG SETUP COMPLETED SUCCESSFULLY!
        
        What's been done:
        1. Checked/pulled the DeepSeek LLM model in Ollama
        2. Updated application file model reference (if found)
        3. Created vector database with proper metadata
        
        You can now run your Chainlit application:
        $ chainlit run rag-chainlit-deepseek.py
        =====================================================
        """)
    else:
        logger.error("""
        =====================================================
        PARTIAL SETUP COMPLETED
        
        What's been done:
        1. Checked/pulled the DeepSeek LLM model in Ollama ✓
        2. Updated application file model reference ✓
        3. Failed to create vector database ✗
        
        Please troubleshoot your Qdrant database connection
        before running your Chainlit application.
        =====================================================
        """)

if __name__ == "__main__":
    main()

Technical Details

Document Chunking

Documents are split into smaller pieces to:

  • Fit within embedding model context windows

  • Allow for more granular retrieval

  • Enable more focused responses

The script employs two levels of chunking:

  1. Initial chunking with HybridChunker from Docling

  2. Secondary chunking with RecursiveCharacterTextSplitter for any chunks that are still too large

Embeddings

Text embeddings are numerical representations of text that capture semantic meaning. Our script uses Sentence Transformers (all-MiniLM-L6-v2) which:

  • Creates 384-dimensional vectors

  • Positions semantically similar text closer together in vector space

  • Enables "meaning-based" search rather than just keyword matching

Vector Search

When a user asks a question:

  1. The question is converted to an embedding

  2. The vector database finds chunks with similar embeddings

  3. The most relevant chunks are sent to the LLM along with the question

  4. The LLM generates a response based on the question and retrieved information

Standard Libraries:

  • os: For handling environment variables and path operations

  • typing: Provides Iterator type hint

Document Processing:

  • langchain_docling: Integrates Docling with LangChain

  • docling.chunking: Provides the HybridChunker for intelligent document chunking

Embedding and Vector Storage:

  • langchain_huggingface.embeddings: For generating document embeddings

  • langchain_qdrant: Connects to Qdrant vector database

Other LangChain Components:

  • Various text splitters and document loaders

Environment Setup:

  • dotenv: For loading environment variables from a .env file

Key Constants

  • EMBED_MODEL_ID: Uses "sentence-transformers/all-MiniLM-L6-v2", which is a lightweight sentence transformer model for creating embeddings

  • EXPORT_TYPE: Set to ExportType.DOC_CHUNKS, meaning the document will be exported as chunked pieces

  • FILE_PATH: Points to a PDF file ("./data/DeepSeek_R1.pdf") that will be processed

Main Function: create_vector_database()

This function handles the entire process of creating a vector database from a PDF document:

1. Connection Verification and Error Handling

# Check if Qdrant is running
if not check_qdrant_running():
    logger.error("Qdrant server is not running. Attempting to start it...")
    if not start_qdrant_container():
        logger.error("""
        Failed to connect to Qdrant server. 
        
        Please ensure Qdrant is installed and running:
        
        To install and run Qdrant with Docker:
          docker run -d -p 6333:6333 -p 6334:6334 -v qdrant_storage:/qdrant/storage qdrant/qdrant
        
        Or run Qdrant locally following instructions at:
          https://qdrant.tech/documentation/quick-start/
        """)
        return False

This begins with a critical connection check to ensure Qdrant is available, with automatic recovery attempts and detailed error instructions if it fails.

2. Document Loading and Chunking with Optimized Parameters

loader = DoclingLoader(
    file_path=FILE_PATH,
    export_type=EXPORT_TYPE,
    chunker=HybridChunker(
        tokenizer=EMBED_MODEL_ID,
        chunk_size=300,  # Reduced chunk size
        chunk_overlap=30,  # Some overlap to maintain context between chunks
        split_factor=0.5,  # More aggressive splitting
    ),
)

logger.info(f"Loading document from {FILE_PATH}")
docling_documents = loader.load()

This initializes a DoclingLoader with optimized parameters to ensure chunks are small enough for the embedding model's context window. This addresses the "Token indices sequence length is longer than the specified maximum sequence length" warning.

3. Advanced Processing with Secondary Chunking

if EXPORT_TYPE == ExportType.DOC_CHUNKS:
    # Create a text splitter with a smaller chunk size
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=400,  # Characters, not tokens, but a safe size
        chunk_overlap=50,
        length_function=lambda text: len(text.split()),  # Approximate token count using word count
        separators=["\n\n", "\n", " ", ""]
    )
    
    # Further split any chunks that might be too large
    logger.info(f"Processing {len(docling_documents)} initial chunks from Docling")
    splits = []
    for doc in docling_documents:
        # Ensure metadata has a 'page' field to avoid KeyError('page')
        if 'page' not in doc.metadata:
            # Extract page number from source if available, or default to 1
            page_num = doc.metadata.get('source', '').split('_')[-1].split('.')[0] if 'source' in doc.metadata else '1'
            try:
                doc.metadata['page'] = int(page_num)
            except ValueError:
                doc.metadata['page'] = 1
        
        # Check if this chunk is potentially too large
        if len(doc.page_content.split()) > 400:  # If chunk has > 400 words, further split it
            logger.info(f"Splitting large chunk of size ~{len(doc.page_content.split())} words")
            smaller_chunks = text_splitter.split_text(doc.page_content)
            # Convert the text chunks back to LangChain Documents with metadata preserved
            splits.extend([
                LCDocument(page_content=chunk, metadata=doc.metadata) 
                for chunk in smaller_chunks
            ])
        else:
            splits.append(doc)
    
    logger.info(f"After additional splitting: {len(splits)} chunks")

This applies secondary chunking to split any remaining large chunks, addressing the token length limitations. It also ensures each chunk has the critical page metadata field to avoid the KeyError('page') error during retrieval.

4. Saving Processed Document with UTF-8 Encoding

with open('data/output_docling.md', 'a', encoding='utf-8') as f:  # utf-8 encoding for unicode support
    for doc in docling_documents:
        f.write(doc.page_content + '\n')

This saves the processed document chunks to a markdown file with explicit UTF-8 encoding to handle special characters, preventing UnicodeEncodeError issues.

5. Creating Embeddings and Validating Data

# Initialize the embedding model from HuggingFace
embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)

# Log metadata of documents for debugging
for i, doc in enumerate(splits[:3]):  # Log first 3 documents as samples
    logger.info(f"Document {i} metadata: {doc.metadata}")

# Check for extremely long chunks before embedding
max_token_length = max([len(doc.page_content.split()) for doc in splits])
logger.info(f"Longest chunk is approximately {max_token_length} words")

if max_token_length > 500:
    logger.warning(f"Some chunks may still be too long for the embedding model (max: {max_token_length} words)")

This initializes the embedding model and adds critical validation steps to detect potential issues before they cause errors.

6. Creating the Vector Store with Error Handling

try:
    # Final check to ensure Qdrant is still running
    if not check_qdrant_running():
        logger.error("Qdrant server connection lost. Please ensure the server is running properly.")
        return False
        
    # Create the vector store
    logger.info(f"Creating vector store at {qdrant_url}")
    vectorstore = QdrantVectorStore.from_documents(
        documents=splits,
        embedding=embedding,
        url=qdrant_url,
        collection_name="rag",
        force_recreate=True,  # Force recreate the collection if it exists
    )
    logger.info(f"Successfully created vector store with {len(splits)} chunks")
    return True
except Exception as e:
    logger.error(f"Error creating vector store: {str(e)}")
    logger.error(f"""
    Failed to connect to Qdrant at {qdrant_url}.
    
    Troubleshooting steps:
    1. Check if Qdrant is running: 
       - For Docker: run 'docker ps' to see if the container is running
       - For local installation: check if the process is running
    
    2. Verify the URL in your .env file:
       - It should contain QDRANT_URL_LOCALHOST=http://localhost:6333
    
    3. Make sure ports 6333 and 6334 are not blocked by a firewall
    
    4. Try running Qdrant manually:
       - Docker: docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
       - Local: follow instructions at https://qdrant.tech/documentation/quick-start/
    """)
    return False

This wrapped the vector store creation in a try-except block with comprehensive error handling and detailed troubleshooting instructions. It also adds the force_recreate=True parameter to ensure clean recreation of collections.

Script Execution with Main Function

The script includes a more robust main function that coordinates the entire process:

def main():
    """Main function to run the complete RAG setup"""
    logger.info("Starting RAG setup process...")
    
    # Step 1: Set up Ollama model
    logger.info("STEP 1: Setting up Ollama model...")
    if not setup_ollama_model():
        logger.error("Failed to set up Ollama model. Exiting.")
        return
    
    # Step 2: Update application model reference if needed
    logger.info("STEP 2: Checking application model reference...")
    updated_file = update_application_model()
    if updated_file:
        logger.info(f"Successfully updated model reference in {updated_file}")
    else:
        logger.warning("No automatic model update performed. Please check your application file manually.")
    
    # Step 3: Create vector database
    logger.info("STEP 3: Creating vector database...")
    db_success = create_vector_database()
    
    # Display appropriate success/error message based on results
    if db_success:
        logger.info("Vector database created successfully!")
        # Display success message with next steps
    else:
        logger.error("PARTIAL SETUP COMPLETED")
        # Display partial success message with troubleshooting guidance

if __name__ == "__main__":
    main()

This main function provides a clear workflow with distinct steps, proper error handling, and informative output at each stage.

chatbot@Office:~/Workshop--LLM/Playground/chainlit$ uv run setup_rag.py 
2025-02-27 15:12:45,677 - INFO - Starting RAG setup process...
2025-02-27 15:12:45,677 - INFO - STEP 1: Setting up Ollama model...
2025-02-27 15:12:45,681 - INFO - Available models: deepseek-llm:latest, phi4:latest
2025-02-27 15:12:45,681 - INFO - DeepSeek model already available
2025-02-27 15:12:45,681 - INFO - STEP 2: Checking application model reference...
2025-02-27 15:12:45,681 - INFO - No incorrect model reference found in ./rag-chainlit-deepseek.py or the file uses a different format.
2025-02-27 15:12:45,681 - WARNING - No automatic model update performed. Please check your application file manually.
2025-02-27 15:12:45,681 - INFO - STEP 3: Creating vector database...
2025-02-27 15:12:45,682 - ERROR - Qdrant server is not running. Attempting to start it...
2025-02-27 15:12:45,716 - INFO - Creating new Qdrant container...
Unable to find image 'qdrant/qdrant:latest' locally
latest: Pulling from qdrant/qdrant
c29f5b76f736: Already exists 
fb9c768cb3bb: Pull complete 
4f4fb700ef54: Pull complete 
7b3a2bcc7760: Pull complete 
93f5a85b768c: Pull complete 
567ae78ca994: Pull complete 
a06ec4f8101e: Pull complete 
d197820764df: Pull complete 
Digest: sha256:318c11b72aaab96b36e9662ad244de3cabd0653a1b942d4e8191f18296c81af0
Status: Downloaded newer image for qdrant/qdrant:latest
962fbf757fdbd7b2b5035c7382d812a97cb66664a8092777fa4ae2a1b877f154
2025-02-27 15:12:49,636 - INFO - Waiting for Qdrant server to start...
2025-02-27 15:12:51,639 - INFO - Qdrant server is now running
2025-02-27 15:12:51,857 - INFO - Loading document from ./data/DeepSeek_R1.pdf
2025-02-27 15:12:51,874 - INFO - Going to convert document batch...
2025-02-27 15:12:52,260 - INFO - Accelerator device: 'cuda:0'
2025-02-27 15:12:54,305 - INFO - Accelerator device: 'cuda:0'
/home/chatbot/Workshop--LLM/Playground/chainlit/.venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
2025-02-27 15:12:55,151 - INFO - Accelerator device: 'cuda:0'
2025-02-27 15:12:55,548 - INFO - Processing document DeepSeek_R1.pdf
2025-02-27 15:13:08,283 - INFO - Finished converting document DeepSeek_R1.pdf in 16.43 sec.
Token indices sequence length is longer than the specified maximum sequence length for this model (1280 > 512). Running this sequence through the model will result in indexing errors
2025-02-27 15:13:08,638 - INFO - Processing 61 initial chunks from Docling
2025-02-27 15:13:08,638 - INFO - Splitting large chunk of size ~417 words
2025-02-27 15:13:08,640 - INFO - After additional splitting: 63 chunks
2025-02-27 15:13:08,882 - INFO - Use pytorch device_name: cuda
2025-02-27 15:13:08,882 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
2025-02-27 15:13:10,530 - INFO - Document 0 metadata: {'source': './data/DeepSeek_R1.pdf', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/1', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 265.74798583984375, 't': 659.0969848632812, 'r': 329.52899169921875, 'b': 648.7059936523438, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 11]}]}, {'self_ref': '#/texts/2', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 237.2270050048828, 't': 635.4180297851562, 'r': 358.04998779296875, 'b': 626.4650268554688, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 21]}]}], 'headings': ['DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 999333483494855273, 'filename': 'DeepSeek_R1.pdf'}}, 'page': 1}
2025-02-27 15:13:10,530 - INFO - Document 1 metadata: {'source': './data/DeepSeek_R1.pdf', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/4', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 70.31800079345703, 't': 551.405029296875, 'r': 526.3280029296875, 'b': 411.89801025390625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 887]}]}, {'self_ref': '#/texts/5', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'caption', 'prov': [{'page_no': 1, 'bbox': {'l': 173.2050018310547, 't': 118.31200408935547, 'r': 422.0740051269531, 'b': 107.67300415039062, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 48]}]}], 'headings': ['Abstract'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 999333483494855273, 'filename': 'DeepSeek_R1.pdf'}}, 'page': 1}
2025-02-27 15:13:10,530 - INFO - Document 2 metadata: {'source': './data/DeepSeek_R1.pdf', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/tables/0', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'document_index', 'prov': [{'page_no': 2, 'bbox': {'l': 69.50955200195312, 't': 719.6405029296875, 'r': 525.8222045898438, 'b': 217.884033203125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 0]}]}], 'headings': ['Contents'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 999333483494855273, 'filename': 'DeepSeek_R1.pdf'}}, 'page': 1}
2025-02-27 15:13:10,530 - INFO - Longest chunk is approximately 400 words
2025-02-27 15:13:10,533 - INFO - Creating vector store at http://localhost:6333
2025-02-27 15:13:10,602 - INFO - HTTP Request: GET http://localhost:6333/collections/rag/exists "HTTP/1.1 200 OK"
2025-02-27 15:13:10,782 - INFO - HTTP Request: PUT http://localhost:6333/collections/rag "HTTP/1.1 200 OK"
2025-02-27 15:13:11,163 - INFO - HTTP Request: PUT http://localhost:6333/collections/rag/points?wait=true "HTTP/1.1 200 OK"
2025-02-27 15:13:11,164 - INFO - Successfully created vector store with 63 chunks
2025-02-27 15:13:11,165 - INFO - Vector database created successfully!
2025-02-27 15:13:11,165 - INFO - 
        =====================================================
        RAG SETUP COMPLETED SUCCESSFULLY!
        
        What's been done:
        1. Checked/pulled the DeepSeek LLM model in Ollama
        2. Updated application file model reference (if found)
        3. Created vector database with proper metadata
        
        You can now run your Chainlit application:
        $ chainlit run rag-chainlit-deepseek.py -p 8501
        =====================================================

Workshop Exercise: Customizing the RAG System

1. Changing the Document Source

To process your own documents:

FILE_PATH = "./path/to/your/document.pdf"

2. Adjusting Chunking Parameters

For longer or more complex documents:

chunker=HybridChunker(
    tokenizer=EMBED_MODEL_ID,
    chunk_size=200,  # Smaller chunks
    chunk_overlap=50,  # More overlap for context
    split_factor=0.7,  # More aggressive splitting
),

3. Using a Different Embedding Model

For different language models or domains:

EMBED_MODEL_ID = "sentence-transformers/all-mpnet-base-v2"  # More powerful, but slower

4. Changing the LLM

To use a different language model:

TARGET_MODEL = "llama3:latest"  # Or any other model available in Ollama

Troubleshooting Common Issues

1. Qdrant Connection Issues

  • Check if Qdrant is running: docker ps | grep qdrant

  • Start Qdrant manually: docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant

  • Verify URL in .env: QDRANT_URL_LOCALHOST=http://localhost:6333

2. Ollama Issues

  • Check if Ollama is running: ollama list

  • Start Ollama: ollama serve

  • Pull models manually: ollama pull deepseek-llm:latest

3. Document Processing Errors

  • Check file path and permissions

  • For large documents, reduce chunk size

  • For complex documents with tables, consider specialized loaders


Next Steps After Running This Script

  1. Start your Chainlit application:

chainlit run rag-chainlit-deepseek.py -p 8501 
  1. Ask questions about your document

  2. The application will:

    • Convert your question to an embedding

    • Find relevant document chunks

    • Send both to the LLM

    • Generate a response based on the document

  3. Monitor performance and refine as needed:

    • Adjust chunk sizes

    • Try different embedding models

    • Experiment with different retrieval settings

This script implements a Retrieval-Augmented Generation (RAG) system using LangChain, Ollama, and Chainlit. The system allows users to ask questions through a chat interface, and the application will:

  1. Retrieve relevant documents from a vector database (Qdrant)

  2. Combine these documents with the user's question in a prompt

  3. Send this prompt to a language model (DeepSeek LLM)

  4. Return the model's response to the user along with the sources of information

# Import standard library modules
import os  # For environment variable access and file operations

# Import type hints for better code documentation
from typing import Iterable  # For type hinting collections that can be iterated over

# Import LangChain document handling
from langchain_core.documents import Document as LCDocument  # Core document class for LangChain

# Import LangChain prompt handling
from langchain.prompts import ChatPromptTemplate  # For creating structured prompts for chat models

# Import embedding model from HuggingFace integration
from langchain_huggingface.embeddings import HuggingFaceEmbeddings  # For text-to-vector conversions

# Import LangChain runnable components for building the pipeline
from langchain.schema.runnable import Runnable, RunnablePassthrough, RunnableConfig  # For creating processing pipelines
from langchain.schema import StrOutputParser  # For parsing LLM outputs as strings

# Import callback handling for tracking pipeline operations
from langchain.callbacks.base import BaseCallbackHandler  # Base class for creating custom callbacks

# Import Ollama integration for accessing local LLMs
from langchain_ollama import OllamaLLM  # For interfacing with locally running Ollama models

# Import Qdrant vector database integration
from langchain_qdrant import QdrantVectorStore  # For connecting to Qdrant vector database

# Import Chainlit for building the chat interface
import chainlit as cl  # Web-based chat interface framework


# Load environment variables from .env file
from dotenv import load_dotenv  # For loading environment variables from a .env file
load_dotenv()  # Execute the loading of environment variables


# Get the Qdrant database URL from environment variables
qdrant_url = os.getenv("QDRANT_URL_LOCALHOST")  # URL for the local Qdrant instance

# Define the embedding model to use - this converts text to vector embeddings
# all-MiniLM-L6-v2 is a lightweight, efficient embedding model with good performance
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"


# Check if the model exists and download if needed
import subprocess
import json

def check_and_download_model(model_name):
    try:
        # List all models
        result = subprocess.run(["ollama", "list"], capture_output=True, text=True, check=True)
        model_list = result.stdout.strip().split('\n')
        
        # Skip header row and check if our model exists
        model_exists = any(model_name in model_line for model_line in model_list[1:] if model_line)
        
        if not model_exists:
            print(f"Model {model_name} not found. Downloading...")
            subprocess.run(["ollama", "pull", model_name], check=True)
            print(f"Downloaded {model_name} successfully.")
        else:
            print(f"Model {model_name} is already available.")
            
    except subprocess.CalledProcessError as e:
        print(f"Error checking or downloading model: {e}")
        raise

# Check and download the model if needed
check_and_download_model("deepseek-llm:latest")

# Initialize the language model using Ollama
# deepseek-llm is the specific model being used for generating responses
llm = OllamaLLM(
    model="deepseek-llm:latest"  # Using the latest version of the deepseek-llm model
)


# This decorator registers this function to run when a new chat session starts
@cl.on_chat_start
async def on_chat_start():
    # Define the prompt template that instructs the LLM how to answer
    # {context} will be filled with retrieved documents
    # {question} will be filled with the user's query
    template = """Answer the question based only on the following context:

    {context}

    Question: {question}
    """
    # Create a prompt object from the template string
    prompt = ChatPromptTemplate.from_template(template)

    # Helper function to format a list of documents into a single string
    # This combines all retrieved document contents with newlines as separators
    def format_docs(docs):
        return "\n\n".join([d.page_content for d in docs])

    # Initialize the embedding model for converting text to vectors
    embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)
    
    # Connect to an existing Qdrant collection named "rag"
    # This collection should already contain embedded documents
    vectorstore = QdrantVectorStore.from_existing_collection(
        embedding=embedding,  # The embedding model to use for query conversion
        collection_name="rag",  # The name of the collection in Qdrant
        url=qdrant_url  # The URL of the Qdrant server
    )
    
    # Create a retriever from the vector store
    # This will be used to find relevant documents based on query similarity
    retriever = vectorstore.as_retriever()

    # Build the RAG pipeline using LangChain's runnable interface
    # This defines the sequence of operations that will process each user query
    runnable = (
        # Step 1: Prepare inputs for the prompt template
        {
            # Retrieve documents and format them into a single context string
            "context": retriever | format_docs, 
            # Pass the user's question through unchanged
            "question": RunnablePassthrough()
        }
        # Step 2: Fill the prompt template with the context and question
        | prompt
        # Step 3: Send the filled prompt to the language model
        | llm
        # Step 4: Parse the LLM output as a string
        | StrOutputParser()
    )

    # Store the runnable in the user's session for reuse with each message
    cl.user_session.set("runnable", runnable)
    
    
# This decorator registers this function to handle each new user message
@cl.on_message
async def on_message(message: cl.Message):
    # Retrieve the runnable from the user's session
    runnable = cl.user_session.get("runnable")  # type: Runnable
    
    # Create an empty message that will be populated with the response
    # The content will be streamed in chunks as it's generated
    msg = cl.Message(content="")

    # Define a custom callback handler to track and display document sources
    class PostMessageHandler(BaseCallbackHandler):
        """
        Callback handler for handling the retriever and LLM processes.
        Used to post the sources of the retrieved documents as a Chainlit element.
        """

        def __init__(self, msg: cl.Message):
            BaseCallbackHandler.__init__(self)
            self.msg = msg  # Store reference to the message being built
            self.sources = set()  # Use a set to store unique source-page pairs

        # This method is called when document retrieval is complete
        def on_retriever_end(self, documents, *, run_id, parent_run_id, **kwargs):
            # Extract source and page information from each retrieved document
            for d in documents:
                source_page_pair = (d.metadata['source'], d.metadata['page'])
                self.sources.add(source_page_pair)  # Add unique pairs to the set

        # This method is called when the LLM finishes generating a response
        def on_llm_end(self, response, *, run_id, parent_run_id, **kwargs):
            # If we have sources to display, format them and add as an element
            if len(self.sources):
                # Create a formatted string of sources with page references
                sources_text = "\n".join([f"{source}#page={page}" for source, page in self.sources])
                # Add the sources as a text element to the message
                self.msg.elements.append(
                    cl.Text(name="Sources", content=sources_text, display="inline")
                )

    # Stream the response from the runnable, processing the user's message
    async for chunk in runnable.astream(
        message.content,  # Pass the user's message content to the runnable
        config=RunnableConfig(callbacks=[
            cl.LangchainCallbackHandler(),  # Standard Chainlit-LangChain integration
            PostMessageHandler(msg)  # Our custom handler for tracking sources
        ]),
    ):
        # Stream each token to the UI as it's generated
        await msg.stream_token(chunk)

    # Send the complete message once streaming is finished
    await msg.send()

Standard Library Imports

import os
import subprocess
import json
from typing import Iterable
  • os: Used to access environment variables (e.g., os.getenv("QDRANT_URL_LOCALHOST"))

  • subprocess: Added to execute Ollama commands for checking and downloading models

  • json: For parsing JSON responses, though not actively used in the current implementation

  • typing.Iterable: Used for type hinting, indicating collections that can be iterated over

LangChain Document Handling

from langchain_core.documents import Document as LCDocument
  • Provides the core Document class which is the fundamental unit in LangChain for text content

  • Each document contains text content (page_content) and metadata (like source, page number)

  • Renamed to LCDocument to avoid naming conflicts

LangChain Prompt Handling

from langchain.prompts import ChatPromptTemplate
  • ChatPromptTemplate: Creates structured prompts for language models

  • Allows template variables (like {context} and {question}) to be filled in dynamically

  • Formats the prompt in a way that the model understands

Embedding Model

from langchain_huggingface.embeddings import HuggingFaceEmbeddings
  • Provides integration with HuggingFace's embedding models

  • Used to convert text into vector representations (embeddings)

  • These embeddings capture semantic meaning, enabling similarity-based retrieval

LangChain Runnable Components

from langchain.schema.runnable import Runnable, RunnablePassthrough, RunnableConfig
from langchain.schema import StrOutputParser
  • Runnable: Base interface for components that can be executed

  • RunnablePassthrough: Passes input directly to output without modifications

  • RunnableConfig: Configuration for runnable components, including callback settings

  • StrOutputParser: Converts LLM outputs to simple strings

Callback Handling

from langchain.callbacks.base import BaseCallbackHandler
  • Provides the BaseCallbackHandler class for creating custom callbacks

  • Callbacks are triggered at different stages of the pipeline execution

  • Used in this script to track retrieved documents and their sources

Ollama Integration

from langchain_ollama import OllamaLLM
  • Provides the OllamaLLM class for connecting to locally running Ollama models

  • Handles communication with the Ollama API

  • Manages model loading, inference requests, and response parsing

Qdrant Vector Database

from langchain_qdrant import QdrantVectorStore
  • Integrates with Qdrant, a vector similarity search database

  • Provides methods to store, retrieve, and search for vector embeddings

  • Enables semantic search across document collections

Chainlit Integration

import chainlit as cl
  • Web-based chat interface framework

  • Provides decorators like @cl.on_chat_start and @cl.on_message to define app behavior

  • Handles streaming, UI elements, and user session management

Environment Setup

from dotenv import load_dotenv
load_dotenv()
qdrant_url = os.getenv("QDRANT_URL_LOCALHOST")
  • load_dotenv(): Loads environment variables from a .env file into the environment

  • The .env file should contain QDRANT_URL_LOCALHOST=http://localhost:6333 or similar

  • This separation of configuration allows the script to be used in different environments without code changes

Model Management

def check_and_download_model(model_name):
    try:
        # List all models
        result = subprocess.run(["ollama", "list"], capture_output=True, text=True, check=True)
        model_list = result.stdout.strip().split('\n')
        
        # Skip header row and check if our model exists
        model_exists = any(model_name in model_line for model_line in model_list[1:] if model_line)
        
        if not model_exists:
            print(f"Model {model_name} not found. Downloading...")
            subprocess.run(["ollama", "pull", model_name], check=True)
            print(f"Downloaded {model_name} successfully.")
        else:
            print(f"Model {model_name} is already available.")
            
    except subprocess.CalledProcessError as e:
        print(f"Error checking or downloading model: {e}")
        raise

check_and_download_model("deepseek-llm:latest")
  • subprocess.run(): Executes shell commands and captures their output

  • ["ollama", "list"]: Lists all available Ollama models

  • The output is parsed to check if our model exists, skipping the header row

  • If the model isn't found, subprocess.run(["ollama", "pull", model_name]) downloads it

  • Proper error handling with try/except blocks captures and reports any issues

  • The function is called with "deepseek-llm:latest" before initializing the model

Model Initialization

EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"

llm = OllamaLLM(
    model="deepseek-llm:latest"
)
  • EMBED_MODEL_ID: Specifies the HuggingFace model to use for embeddings

    • "sentence-transformers/all-MiniLM-L6-v2" is chosen for its efficiency and performance

    • It balances quality and speed, making it good for production systems

  • OllamaLLM: Initializes the language model using Ollama

    • "deepseek-llm:latest" is the DeepSeek LLM model for response generation

    • "latest" ensures using the most recent version available

Application Lifecycle Components

On Chat Start Function

@cl.on_chat_start
async def on_chat_start():
  • @cl.on_chat_start: A decorator that registers this function to run once when a new chat session starts

  • async: Indicates this is an asynchronous function, allowing non-blocking operations

Prompt Template Definition

template = """Answer the question based only on the following context:

{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
  • Defines a string template with placeholders for context and question

  • The template instructs the LLM to answer based only on the provided context

  • ChatPromptTemplate.from_template() converts this string into a prompt object

Document Formatting

def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])
  • A helper function that takes a list of document objects

  • Extracts the page_content from each document

  • Joins them with double newlines to create a single context string

Embedding Model Initialization

embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)
  • Initializes the embedding model using the specified model ID

  • This model converts text to vector embeddings for similarity search

Vector Store Connection

vectorstore = QdrantVectorStore.from_existing_collection(
    embedding=embedding,
    collection_name="rag",
    url=qdrant_url
)
  • Connects to an existing Qdrant collection named "rag"

  • Uses the initialized embedding model for query conversion

  • The collection should already contain embedded documents

  • url=qdrant_url specifies the Qdrant server location

Retriever Creation

retriever = vectorstore.as_retriever()
  • Creates a retriever object from the vector store

  • By default, this retriever will return the top-k most similar documents

  • Can be configured with parameters like search_kwargs={"k": 4} to specify the number of documents

RAG Pipeline Construction

runnable = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)
  • Defines a pipeline using LangChain's runnable interface:

    1. retriever | format_docs: Retrieves relevant documents and formats them into a context string

    2. RunnablePassthrough(): Passes the user's question through unchanged

    3. | prompt: Fills the prompt template with the context and question

    4. | llm: Sends the filled prompt to the language model

    5. | StrOutputParser(): Parses the LLM output as a simple string

  • The pipe operator (|) chains these operations together

Session Storage

cl.user_session.set("runnable", runnable)
  • Stores the runnable pipeline in the user's session

  • Makes it available for reuse with each message in the conversation

  • Avoids recreating the pipeline for every message

On Message Function

@cl.on_message
async def on_message(message: cl.Message):
  • @cl.on_message: A decorator that registers this function to handle each new user message

  • Takes a cl.Message object containing the user's input

Runnable Retrieval

runnable = cl.user_session.get("runnable")  # type: Runnable
  • Retrieves the previously stored runnable from the user's session

  • Type hint indicates it's a Runnable object

Message Creation

msg = cl.Message(content="")
  • Creates an empty message that will be populated with the response

  • The content will be streamed in chunks as it's generated

Custom Callback Handler

class PostMessageHandler(BaseCallbackHandler):
    def __init__(self, msg: cl.Message):
        BaseCallbackHandler.__init__(self)
        self.msg = msg
        self.sources = set()

    def on_retriever_end(self, documents, *, run_id, parent_run_id, **kwargs):
        for d in documents:
            source_page_pair = (d.metadata['source'], d.metadata['page'])
            self.sources.add(source_page_pair)

    def on_llm_end(self, response, *, run_id, parent_run_id, **kwargs):
        if len(self.sources):
            sources_text = "\n".join([f"{source}#page={page}" for source, page in self.sources])
            self.msg.elements.append(
                cl.Text(name="Sources", content=sources_text, display="inline")
            )
  • Defines a custom callback handler that extends BaseCallbackHandler

  • Tracks the sources of retrieved documents

  • Has two main methods:

    • on_retriever_end: Called when document retrieval is complete

      • Extracts source and page information from each document's metadata

      • Adds unique source-page pairs to a set to avoid duplicates

    • on_llm_end: Called when the LLM finishes generating a response

      • Formats the collected sources into a text string

      • Adds this as a text element to the message

Response Streaming

async for chunk in runnable.astream(
    message.content,
    config=RunnableConfig(callbacks=[
        cl.LangchainCallbackHandler(),
        PostMessageHandler(msg)
    ]),
):
    await msg.stream_token(chunk)

await msg.send()
  • runnable.astream(): Processes the user's message asynchronously, streaming the results

  • message.content: The text of the user's message

  • RunnableConfig(callbacks=[...]): Configures callbacks for the execution:

    • cl.LangchainCallbackHandler(): Standard Chainlit-LangChain integration

    • PostMessageHandler(msg): Our custom handler for tracking sources

  • await msg.stream_token(chunk): Streams each token to the UI as it's generated

  • await msg.send(): Sends the complete message once streaming is finished

x

x

x

x

If you’ve set up a deployment locally with the Qdrant , navigate to http://localhost:6333/dashboard.

If you’ve set up a deployment in a cloud cluster, find your Cluster URL in your cloud dashboard, at . Add :6333/dashboard to the end of the URL.

List and search existing

Quickstart
https://cloud.qdrant.io
collections
http://localhost:6333/dashboardlocalhost
Link to Qdrant dashboard
RAG
Qdrant UI
rag collection
RAG
Installation | uv
Link to Installation of UV
Logo