Knowledge Base

Document loaders allow you to load documents from different sources like PDF, TXT, CSV, Notion, Confluence etc. They are often used together with Vector Stores to be upserted as embeddings, which can then retrieved upon query.

Drag & drop the PDF file.
Click Upload File and navigate to the PDF.

Select the option: One document per file

One document per page: each page of your original PDF will be extracted and saved as its own standalone PDF file

One document per file: each file will be treated as a separate, complete document rather than combining multiple files into a single document.

The Recursive Character Text Splitter is a technique used in natural language processing and document handling to break down long text documents into smaller, manageable chunks while preserving context and meaning.

Unlike simple character or token splitters that might cut text at arbitrary points, the Recursive Character Text Splitter works hierarchically. It first attempts to split text along natural boundaries like paragraphs, then sentences, and finally characters if needed. This recursive approach ensures that related content stays together when possible, maintaining semantic coherence within chunks.

This technique is particularly valuable when working with large language models that have context window limitations. By intelligently chunking documents, it allows for processing lengthy texts while preserving the contextual relationships needed for tasks like summarization, question answering, and information retrieval. LangChain implements this splitter to help developers manage document processing pipelines effectively.

Drag & drop a Recursive Character Text Splitter and set the Chunk Size and Chunk Overlap.
Set the Chunk size as: 500 and the Chunk Overlap as: 20

Embeddings serve as the foundation of modern natural language processing, transforming text into dense vector representations that capture semantic meaning. These numerical representations allow machines to understand relationships between words and concepts, enabling powerful applications like semantic search, clustering, and recommendation systems.

The Nomic-embed-text model offers a versatile embedding solution with configurable settings to balance performance and resource requirements. Users can adjust the dimensionality parameter (typically set between 128 and 768 dimensions), with higher dimensions capturing more nuanced semantic relationships at the cost of increased computational overhead. The model also provides batch size configuration to optimize throughput, with default settings balancing efficiency and memory usage.

Ensure vector size is the same value set in your Vector database - 768

Drag & drop the Ollama Embeddings.
Set the Base URL: http://localhost:11434
Set the Model Name: nomic-embed-text
Click on Additional Parameters to assign the GPUs.

MMap (Memory Mapping) refers to a technique used to efficiently load and access large embedding files from disk. This is particularly important when working with large-scale embedding models that might not fit into memory.

Open-source vector database designed to handle high-dimensional vectors for performance and massive-scale AI applications.

Ensure that the Pdf File > Qdrant & Ollama Embeddings are connected to the Qdrant vector database.
Connect with the Qdrant API credential thats been set in the Credentials section - the API key can be anything as its a local instance.
When pointing to the Qdrant server use the container name in the

URL: http://localhost:11434

Set the Qdrant Collection Name: Pentaho Data Integration.

Ensure the Qdrant vector size = Embedding model vector size (in this case 768)

Check all connections and the Flow is saved ..
You're ready to Upload .. Click on the green database icon in top right.

You can expand each Node to check the settings.

Click on Upsert.
Once the process has completed it will display the first 20 chunks and indicate the number of added records for that document.

In Qdrant you can also view the Pentaho Data Integration collection.

To check the collection .. just click on the name which displays a bunch of options:

view each record
display collection stats
check the search quality
take snapshots
visualize & graph results