Table of Contents
Fetching ...

ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization

Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Dheeraj Kulshrestha, Rajiv Ramnath

TL;DR

ARIAL introduces an agentic, modular framework for Document VQA that pairs an LLM-driven planner with specialized OCR, retrieval, QA, and grounding tools to achieve both precise answer localization and strong textual accuracy. By decomposing tasks and providing explicit bounding-box grounding through a retrieval-augmented reasoning loop, ARIAL delivers state-of-the-art results across DocVQA, FUNSD, CORD, and SROIE while offering interpretable tool traces for auditability. Key contributions include a learnable planning agent (LLaMA 4 Scout), retrieval-augmented QA (Gemma 3-27B), and explicit spatial grounding, all validated through extensive ablations. The work demonstrates that modular, explainable AI pipelines can surpass monolithic models in both performance and trustworthiness for high-stakes document understanding tasks.

Abstract

Document Visual Question Answering (VQA) requires models to not only extract accurate textual answers but also precisely localize them within document images, a capability critical for interpretability in high-stakes applications. However, existing systems achieve strong textual accuracy while producing unreliable spatial grounding, or sacrifice performance for interpretability. We present ARIAL (Agentic Reasoning for Interpretable Answer Localization), a modular framework that orchestrates specialized tools through an LLM-based planning agent to achieve both precise answer extraction and reliable spatial grounding. ARIAL decomposes Document VQA into structured subtasks: OCR-based text extraction with TrOCR, retrieval-augmented context selection using semantic search, answer generation via a fine-tuned Gemma 3-27B model, and explicit bounding-box localization through text-to-region alignment. This modular architecture produces transparent reasoning traces, enabling tool-level auditability and independent component optimization. We evaluate ARIAL on four benchmarks (DocVQA, FUNSD, CORD, and SROIE) using both textual accuracy (ANLS) and spatial precision (mAP at IoU 0.50 to 0.95). ARIAL achieves state-of-the-art results across all datasets: 88.7 ANLS and 50.1 mAP on DocVQA, 90.0 ANLS and 50.3 mAP on FUNSD, 85.5 ANLS and 60.2 mAP on CORD, and 93.1 ANLS on SROIE, surpassing the previous best method (DLaVA) by +2.8 ANLS and +3.9 mAP on DocVQA. Our work demonstrates how agentic orchestration of specialized tools can simultaneously improve performance and interpretability, providing a pathway toward trustworthy, explainable document AI systems.

ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization

TL;DR

ARIAL introduces an agentic, modular framework for Document VQA that pairs an LLM-driven planner with specialized OCR, retrieval, QA, and grounding tools to achieve both precise answer localization and strong textual accuracy. By decomposing tasks and providing explicit bounding-box grounding through a retrieval-augmented reasoning loop, ARIAL delivers state-of-the-art results across DocVQA, FUNSD, CORD, and SROIE while offering interpretable tool traces for auditability. Key contributions include a learnable planning agent (LLaMA 4 Scout), retrieval-augmented QA (Gemma 3-27B), and explicit spatial grounding, all validated through extensive ablations. The work demonstrates that modular, explainable AI pipelines can surpass monolithic models in both performance and trustworthiness for high-stakes document understanding tasks.

Abstract

Document Visual Question Answering (VQA) requires models to not only extract accurate textual answers but also precisely localize them within document images, a capability critical for interpretability in high-stakes applications. However, existing systems achieve strong textual accuracy while producing unreliable spatial grounding, or sacrifice performance for interpretability. We present ARIAL (Agentic Reasoning for Interpretable Answer Localization), a modular framework that orchestrates specialized tools through an LLM-based planning agent to achieve both precise answer extraction and reliable spatial grounding. ARIAL decomposes Document VQA into structured subtasks: OCR-based text extraction with TrOCR, retrieval-augmented context selection using semantic search, answer generation via a fine-tuned Gemma 3-27B model, and explicit bounding-box localization through text-to-region alignment. This modular architecture produces transparent reasoning traces, enabling tool-level auditability and independent component optimization. We evaluate ARIAL on four benchmarks (DocVQA, FUNSD, CORD, and SROIE) using both textual accuracy (ANLS) and spatial precision (mAP at IoU 0.50 to 0.95). ARIAL achieves state-of-the-art results across all datasets: 88.7 ANLS and 50.1 mAP on DocVQA, 90.0 ANLS and 50.3 mAP on FUNSD, 85.5 ANLS and 60.2 mAP on CORD, and 93.1 ANLS on SROIE, surpassing the previous best method (DLaVA) by +2.8 ANLS and +3.9 mAP on DocVQA. Our work demonstrates how agentic orchestration of specialized tools can simultaneously improve performance and interpretability, providing a pathway toward trustworthy, explainable document AI systems.

Paper Structure

This paper contains 22 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the ARIAL agentic workflow for Document VQA. The system consists of three modular stages: (1) Input Processing, where an OCR module extracts text segments and bounding boxes from a document image; (2) Agentic Reasoning Pipeline, where the planner agent coordinates task execution—retrieving relevant text, invoking QA or computation, and triggering spatial grounding; and (3) Output Generation, where the final answer and its bounding box are produced. The reasoning loop enables iterative refinement based on confidence, supporting flexible and context-aware decision-making.
  • Figure 2: Illustrative examples of visual information extraction on receipt images from the CORD dataset park2019cord. Each colored annotation corresponds to its extracted answer, highlighted by a matching colored bounding box.