Table of Contents
Fetching ...

ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering

Aymen Lassoued, Mohamed Ali Souibgui, Yousri Kessentini

TL;DR

The proposed ORCA framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components, establishing a new paradigm for collaborative agent systems in vision-language reasoning.

Abstract

Document Visual Question Answering (DocVQA) remains challenging for existing Vision-Language Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components. To ensure answer reliability, ORCA employs a debate mechanism with stress-testing, and when necessary, a thesis-antithesis adjudication process. This is followed by a sanity checker to ensure format consistency. Extensive experiments on three benchmarks demonstrate that our approach achieves significant improvements over state-of-the-art methods, establishing a new paradigm for collaborative agent systems in vision-language reasoning.

ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering

TL;DR

The proposed ORCA framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components, establishing a new paradigm for collaborative agent systems in vision-language reasoning.

Abstract

Document Visual Question Answering (DocVQA) remains challenging for existing Vision-Language Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components. To ensure answer reliability, ORCA employs a debate mechanism with stress-testing, and when necessary, a thesis-antithesis adjudication process. This is followed by a sanity checker to ensure format consistency. Extensive experiments on three benchmarks demonstrate that our approach achieves significant improvements over state-of-the-art methods, establishing a new paradigm for collaborative agent systems in vision-language reasoning.
Paper Structure (39 sections, 14 equations, 5 figures, 9 tables, 3 algorithms)

This paper contains 39 sections, 14 equations, 5 figures, 9 tables, 3 algorithms.

Figures (5)

  • Figure 1: Comparison of different approaches for DocVQA. Single-model VLMs and reasoning-enhanced VLMs lack critical capabilities such as adaptivity and self-verification. In contrast, ORCA introduces a feature-oriented, multi-agent design achieving improved DocVQA performance as well as the missing capabilities in one unified framework.
  • Figure 2: Overview of ORCA: A reasoning-guided multi-agent framework for Document Visual Question Answering operating through five stages: (1) Context Understanding: A thinker agent analyzes the question and document to generate both a reasoning path and initial answer ($a_T$). (2) Collaborative Agent Execution: A router selects relevant specialized agents from a dock of nine expert types (OCR, Layout, Table/List, Figure/Diagram, Form, Free Text, Image/Photo, Yes/No, and General), which an orchestrator sequences for optimal execution to produce an expert answer ($a_E$). (3) Stress Testing: When $a_E$ differs from $a_T$, a debate agent generates challenging questions to stress-test the specialized agent's confidence, with an evaluation agent assessing the responses to produce $a_D$. (4) Multi-turn Conversation: If stress testing indicates uncertainty, thesis and antithesis agents engage in structured three-turn debate under judge supervision to resolve conflicts and generate $a_C$. (5) Answer Refinement: A sanity checker performs final formatting corrections to ensure consistency with document conventions, producing the final answer ($a_F$).
  • Figure 3: A case study demonstrating ORCA's multi-agent reasoning pipeline on a complex visual document question. Better viewed with zoom. Question: What publication detail accompanies the Genealogical Society entry?. GT answer: "GSU, 1977".
  • Figure 4: ORCA demonstrates robust multi-stage reasoning on a document containing ambiguous textual references and visually challenging OCR content. While baseline VLMs fail due to misidentification and shallow pattern matching, ORCA decomposes the task into OCR parsing, cell-level localization, cross-reference verification, and answer consistency checking. Through iterative agent collaboration and critical evidence consolidation, ORCA resolves ambiguity, corrects earlier misinterpretations, and converges on the correct entity with high confidence
  • Figure 5: ORCA successfully handles a structurally complex form where precise line indexing, noisy OCR text, and subtle vocabulary variations mislead baseline VLMs. By combining layout-aware processing, content-aware sequence reasoning, and downstream sanity validation, ORCA incrementally narrows the search space and suppresses earlier incorrect hypotheses. The multi-agent pipeline enables reliable disambiguation and robust extraction even under OCR artifacts and positional uncertainty.