Table of Contents
Fetching ...

Advanced ingestion process powered by LLM parsing for RAG system

Arnau Perez, Xavier Vizcaino

TL;DR

The paper tackles the challenge of processing multimodal, structurally diverse documents under limited context windows by proposing a multi-strategy ingestion pipeline that combines FAST, LLM-powered OCR, and semantic parsing with a node-based hierarchical representation. It introduces a Multimodal Assembler Agent and a flexible embedding strategy, with careful selection of vector databases and per-node embedding rules to optimize retrieval fidelity. Evaluation across heterogeneous knowledge bases shows improvements in Answer Relevancy and Faithfulness, while revealing trade-offs in contextual retrieval and highlighting the potential gains from reranking and improved concept linking. The approach advances RAG workflows for complex documents like slides and scanned PDFs, enabling more accurate and contextually aware retrieval in practical applications.

Abstract

Retrieval Augmented Generation (RAG) systems struggle with processing multimodal documents of varying structural complexity. This paper introduces a novel multi-strategy parsing approach using LLM-powered OCR to extract content from diverse document types, including presentations and high text density files both scanned or not. The methodology employs a node-based extraction technique that creates relationships between different information types and generates context-aware metadata. By implementing a Multimodal Assembler Agent and a flexible embedding strategy, the system enhances document comprehension and retrieval capabilities. Experimental evaluations across multiple knowledge bases demonstrate the approach's effectiveness, showing improvements in answer relevancy and information faithfulness.

Advanced ingestion process powered by LLM parsing for RAG system

TL;DR

The paper tackles the challenge of processing multimodal, structurally diverse documents under limited context windows by proposing a multi-strategy ingestion pipeline that combines FAST, LLM-powered OCR, and semantic parsing with a node-based hierarchical representation. It introduces a Multimodal Assembler Agent and a flexible embedding strategy, with careful selection of vector databases and per-node embedding rules to optimize retrieval fidelity. Evaluation across heterogeneous knowledge bases shows improvements in Answer Relevancy and Faithfulness, while revealing trade-offs in contextual retrieval and highlighting the potential gains from reranking and improved concept linking. The approach advances RAG workflows for complex documents like slides and scanned PDFs, enabling more accurate and contextually aware retrieval in practical applications.

Abstract

Retrieval Augmented Generation (RAG) systems struggle with processing multimodal documents of varying structural complexity. This paper introduces a novel multi-strategy parsing approach using LLM-powered OCR to extract content from diverse document types, including presentations and high text density files both scanned or not. The methodology employs a node-based extraction technique that creates relationships between different information types and generates context-aware metadata. By implementing a Multimodal Assembler Agent and a flexible embedding strategy, the system enhances document comprehension and retrieval capabilities. Experimental evaluations across multiple knowledge bases demonstrate the approach's effectiveness, showing improvements in answer relevancy and information faithfulness.

Paper Structure

This paper contains 21 sections, 6 equations, 11 figures.

Figures (11)

  • Figure 1: Preprocessing pipeline for document ingestion in RAG system. The flowchart illustrates the parsing and assembling process for PDF, DOCX, and PPTX files. The pipeline incorporates FAST, OCR, and LLM parsing techniques, followed by image description, text extraction, and snapshot creation. The assembler combines these elements to produce a markdown file per page and a concatenation of them.
  • Figure 2: The illustration shows the assembling of the page 2 of the paper AIDAPaper. There are 5 kind of nodes represented: Header, Text, Table, Image and Page. Note that the image node is represented in the markdown file where the src is the id of image extracted plus its extension and the alt is the description of the image.
  • Figure 3: Comparative Analysis of Knowledge Base Performance Across Different Metrics. The figure illustrates the percentage results of various metrics employed in the study. From left to right, the graph presents data for three distinct knowledge bases: a collection of academic articles covering diverse topics, corporate documentation from Applus+ IDIADA, and a heterogeneous knowledge base comprising mixed topics and file structures, including both presentation-style and high text-density documents.
  • Figure :
  • Figure :
  • ...and 6 more figures