Table of Contents
Fetching ...

Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling

Xinlei Yu, Chengming Xu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Cheng Yang, Jiangning Zhang, Shuicheng Yan, Xiaobin Hu

TL;DR

This paper tackles the diminishing returns of monolithic scaling for visual document understanding by introducing MACT, a four-agent framework (planning, execution, judgment, answer) that enables procedural scaling and a self-correction loop for factual grounding. It couples agent-wise test-time scaling with a mixed reward modeling scheme to align incentives across agents, and trains via a two-stage pipeline using VLMs for planning/execution and LLMs for judgment/answer, with RL rewards from VisualPRM and Skywork-VL-Reward. Evaluated on 15 benchmarks spanning document and non-document tasks, MACT variants consistently outperform baselines and rival larger monolithic models, especially on long-context and mathematic reasoning, demonstrating the value of procedural scaling. Ablation studies confirm the importance of multi-agent collaboration, adaptive scaling, and reward design, supporting MACT as a practical, scalable paradigm for document-based reasoning.

Abstract

The dominant paradigm of monolithic scaling in Vision-Language Models (VLMs) is failing for understanding and reasoning in documents, yielding diminishing returns as it struggles with the inherent need of this domain for document-based procedural reasoning, cognitive complexity, and factual accuracy. To this end, we introduce MACT, a Multi-Agent Collaboration framework with agent-wise adaptive Test-time scaling that pioneers a paradigm shift to procedural scaling, adapting dynamically to the functional entities of visual documents understanding and reasoning. MACT decomposes the visual document processing flow into four specialized agents, i.e., planning, execution, judgment, and answer, to resolve cognitive overload and introduce a critical self-correction loop for factual grounding. This collaborative architecture is amplified by an agent-wise adaptive test-time scaling strategy that intelligently allocates computational resources based on the complexity and redundancy of each functionality. Evaluated on multiple visual document understanding benchmarks, MACT achieves superior performance with a smaller parameter scale, adapting effectively to various document scenarios without compromising its general or mathematical reasoning capabilities. The three variants of MACT consistently attain top-three average performance rankings, with average performance enhancements of 9.9-11.5% over the base models. The source code will be released publicly.

Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling

TL;DR

This paper tackles the diminishing returns of monolithic scaling for visual document understanding by introducing MACT, a four-agent framework (planning, execution, judgment, answer) that enables procedural scaling and a self-correction loop for factual grounding. It couples agent-wise test-time scaling with a mixed reward modeling scheme to align incentives across agents, and trains via a two-stage pipeline using VLMs for planning/execution and LLMs for judgment/answer, with RL rewards from VisualPRM and Skywork-VL-Reward. Evaluated on 15 benchmarks spanning document and non-document tasks, MACT variants consistently outperform baselines and rival larger monolithic models, especially on long-context and mathematic reasoning, demonstrating the value of procedural scaling. Ablation studies confirm the importance of multi-agent collaboration, adaptive scaling, and reward design, supporting MACT as a practical, scalable paradigm for document-based reasoning.

Abstract

The dominant paradigm of monolithic scaling in Vision-Language Models (VLMs) is failing for understanding and reasoning in documents, yielding diminishing returns as it struggles with the inherent need of this domain for document-based procedural reasoning, cognitive complexity, and factual accuracy. To this end, we introduce MACT, a Multi-Agent Collaboration framework with agent-wise adaptive Test-time scaling that pioneers a paradigm shift to procedural scaling, adapting dynamically to the functional entities of visual documents understanding and reasoning. MACT decomposes the visual document processing flow into four specialized agents, i.e., planning, execution, judgment, and answer, to resolve cognitive overload and introduce a critical self-correction loop for factual grounding. This collaborative architecture is amplified by an agent-wise adaptive test-time scaling strategy that intelligently allocates computational resources based on the complexity and redundancy of each functionality. Evaluated on multiple visual document understanding benchmarks, MACT achieves superior performance with a smaller parameter scale, adapting effectively to various document scenarios without compromising its general or mathematical reasoning capabilities. The three variants of MACT consistently attain top-three average performance rankings, with average performance enhancements of 9.9-11.5% over the base models. The source code will be released publicly.

Paper Structure

This paper contains 13 sections, 6 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparisons among the three variants of MACT, the base models of these variants, and larger-scale models within the same model family, indicating the superiority of our framework over monolithic scaling paradigms.
  • Figure 2: The overview of MACT. The upper part demonstrates our procedural framework, with four tailored and collaborative agents to conduct the process of document analysis. When the judgment agent detects mistakes, it redirects to previous agents for corrections. The lower part illustrates the agent-wise adaptive test-time scaling and mixed reward modeling for the multi-agent framework.
  • Figure 3: Comparisons of (a) internal correction, (b) an extra agent for both judgment and correction, and (c) our strategy utilizing an independent judgment agent.
  • Figure 4: (a) Line graph shows the impact of various maximum numbers of corrections, with solid and dashed lines denoting average values across all and three selected benchmarks, respectively. The bar charts show the average judgment numbers when the maximum is set to 3. (b) The line graphs represent the impacts of the number of generated plans $\mathit{N_p}$ and candidate executions $\mathit{N_e}$.