PDF Parser

At some stage you will have to deal with unstructured data. This can be tricky especially if you need to output the results in a specific format - csv, json, etc ..

In Open WebUI download the mistral:7b model

Let's create a new Chain. Click on: '+ Add New' blue button in top right.
Save as: 'PDF Parser'.

Add a Tool Agent. Click on the blue + sign and select Chains > LLM Chain.
Drag & drop onto the canvas.

Note that the LLM Chain has an option to define the Output Parser

Next add a Chat Model. Again, Click on the blue + sign and select Chat Models > ChatOllama.
Connect the ChatOllama to the LLM Chain, by dragging a connector from ChatOllama to LLM Chain.
Configure the model as illustrated.

The ChatOllama has an option to upload images. Check that the model has reasoning image capabilities: mistral:7b

Use the options at the bottom to resize canvas.

Let's add a Prompt Template: Prompts > Prompt Template, and connect to LLM Chain.

A Prompt Template is similar to the System Prompt in the Chat Model.

Before we progress any further, lets Save.
In the Prompt Template, expand the Template. Were going to add some instructions based on the PDF.

Take a look at the main section of the sample-invoice.pdf:

Invoice Number

Order Number

Invoice Date

Due Date

Tax

Total

and so on ...

We can instruct the model to extract the required information.

To include the invoice content within the prompt we need to add a variable: {invoice}. It can be anything that makes sense..!

In the Template lets add some instructions.

Save .. now we need to associate the PDF with the {invoice} variable. Click on Format Prompt Values.

Save .. To enable the file to be uploaded into the chat: Settings > Configuration.

Enable File Upload .. and Save.

Let's give it a go ..!

The sample-invoice.pdf is located in: Workshop--LLM/ Data folder.

The purpose of this Chain is to be called from some external system to parse the unstructured data source, extract the required information to be consumed further downstream - JSON Object.

Add the Structured Output Parser: Output Parsers > Structured Output Parser

Enable Autofix and click on: 'Additional Parameters'.

The Additional Parameters setting enables you to define the fields and data type in the JSON object. The model uses the description to map the value to the 'Property'.

Double-click in the Property / Type / description field to edit and select the type.

Property

Type

Description

invoice_id

string

Invoice Number

order_id

number

Order Number

service_type

string

Service

due_date

string

Due Date

total

number

Total

And finally ..

This workflow could be in a number of use cases: from document classification to

PreviousCompany Information Agent NextKnowledge Base

Last updated 5 months ago