Table of Contents
Fetching ...

AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

Zhengren Wang, Dongsheng Ma, Huaping Zhong, Jiayu Li, Wentao Zhang, Bin Wang, Conghui He

TL;DR

AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full-text process into a query-driven, on-demand extraction system, is introduced, achieving expert-level performance in long document understanding.

Abstract

The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, such as financial reports. While page-level chunking and retrieval is a natural starting point, it creates a critical bottleneck: delivering entire pages to the generator introduces excessive extraneous context. This not only overloads the generator's attention mechanism but also dilutes the most salient evidence. Moreover, compressing these information-rich pages into a limited visual token budget further increases the risk of hallucinations. To address this, we introduce AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full-text process into a query-driven, on-demand extraction system. By autonomously analyzing document layout in a "thinking with images" manner, AgenticOCR identifies and selectively recognizes regions of interest. This approach performs on-demand decompression of visual tokens precisely where needed, effectively decoupling retrieval granularity from rigid page-level chunking. AgenticOCR has the potential to serve as the "third building block" of the visual document RAG stack, operating alongside and enhancing standard Embedding and Reranking modules. Experimental results demonstrate that AgenticOCR improves both the efficiency and accuracy of visual RAG systems, achieving expert-level performance in long document understanding. Code and models are available at https://github.com/OpenDataLab/AgenticOCR.

AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

TL;DR

AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full-text process into a query-driven, on-demand extraction system, is introduced, achieving expert-level performance in long document understanding.

Abstract

The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, such as financial reports. While page-level chunking and retrieval is a natural starting point, it creates a critical bottleneck: delivering entire pages to the generator introduces excessive extraneous context. This not only overloads the generator's attention mechanism but also dilutes the most salient evidence. Moreover, compressing these information-rich pages into a limited visual token budget further increases the risk of hallucinations. To address this, we introduce AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full-text process into a query-driven, on-demand extraction system. By autonomously analyzing document layout in a "thinking with images" manner, AgenticOCR identifies and selectively recognizes regions of interest. This approach performs on-demand decompression of visual tokens precisely where needed, effectively decoupling retrieval granularity from rigid page-level chunking. AgenticOCR has the potential to serve as the "third building block" of the visual document RAG stack, operating alongside and enhancing standard Embedding and Reranking modules. Experimental results demonstrate that AgenticOCR improves both the efficiency and accuracy of visual RAG systems, achieving expert-level performance in long document understanding. Code and models are available at https://github.com/OpenDataLab/AgenticOCR.
Paper Structure (35 sections, 4 equations, 8 figures, 4 tables)

This paper contains 35 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of AgenticOCR-based RAG. AgenticOCR performs on-demand decompression of visual information precisely where it is needed by utilizing operations such as zoom and rotate.
  • Figure 2: The training and inference of AgenticOCR. Built upon the image_zoom_and_ocr_tool, trajectory distillation is first performed to initialize the SFT policy. The model is subsequently optimized through GRPO and finally deployed through an integration protocol within visual RAG pipelines.
  • Figure 3: Distribution of positive trajectories in SFT dataset.
  • Figure 4: RL curves for reward and its standard deviation. The trend demonstrates that the agent effectively learns to use tools strategically to solve tasks.
  • Figure 5: An example of the AgenticOCR Model retrieval and evidence extraction workflow. The figure illustrates the input document, the key region identified by the model, the visualization of structured HTML table output, and the final evidence crop image, demonstrating the system's ability to perform on-demand decompression of visual information through zoom and OCR operations.
  • ...and 3 more figures