Advanced ingestion process powered by LLM parsing for RAG system
Arnau Perez, Xavier Vizcaino
TL;DR
The paper tackles the challenge of processing multimodal, structurally diverse documents under limited context windows by proposing a multi-strategy ingestion pipeline that combines FAST, LLM-powered OCR, and semantic parsing with a node-based hierarchical representation. It introduces a Multimodal Assembler Agent and a flexible embedding strategy, with careful selection of vector databases and per-node embedding rules to optimize retrieval fidelity. Evaluation across heterogeneous knowledge bases shows improvements in Answer Relevancy and Faithfulness, while revealing trade-offs in contextual retrieval and highlighting the potential gains from reranking and improved concept linking. The approach advances RAG workflows for complex documents like slides and scanned PDFs, enabling more accurate and contextually aware retrieval in practical applications.
Abstract
Retrieval Augmented Generation (RAG) systems struggle with processing multimodal documents of varying structural complexity. This paper introduces a novel multi-strategy parsing approach using LLM-powered OCR to extract content from diverse document types, including presentations and high text density files both scanned or not. The methodology employs a node-based extraction technique that creates relationships between different information types and generates context-aware metadata. By implementing a Multimodal Assembler Agent and a flexible embedding strategy, the system enhances document comprehension and retrieval capabilities. Experimental evaluations across multiple knowledge bases demonstrate the approach's effectiveness, showing improvements in answer relevancy and information faithfulness.
