Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling
Xinlei Yu, Chengming Xu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Cheng Yang, Jiangning Zhang, Shuicheng Yan, Xiaobin Hu
TL;DR
This paper tackles the diminishing returns of monolithic scaling for visual document understanding by introducing MACT, a four-agent framework (planning, execution, judgment, answer) that enables procedural scaling and a self-correction loop for factual grounding. It couples agent-wise test-time scaling with a mixed reward modeling scheme to align incentives across agents, and trains via a two-stage pipeline using VLMs for planning/execution and LLMs for judgment/answer, with RL rewards from VisualPRM and Skywork-VL-Reward. Evaluated on 15 benchmarks spanning document and non-document tasks, MACT variants consistently outperform baselines and rival larger monolithic models, especially on long-context and mathematic reasoning, demonstrating the value of procedural scaling. Ablation studies confirm the importance of multi-agent collaboration, adaptive scaling, and reward design, supporting MACT as a practical, scalable paradigm for document-based reasoning.
Abstract
The dominant paradigm of monolithic scaling in Vision-Language Models (VLMs) is failing for understanding and reasoning in documents, yielding diminishing returns as it struggles with the inherent need of this domain for document-based procedural reasoning, cognitive complexity, and factual accuracy. To this end, we introduce MACT, a Multi-Agent Collaboration framework with agent-wise adaptive Test-time scaling that pioneers a paradigm shift to procedural scaling, adapting dynamically to the functional entities of visual documents understanding and reasoning. MACT decomposes the visual document processing flow into four specialized agents, i.e., planning, execution, judgment, and answer, to resolve cognitive overload and introduce a critical self-correction loop for factual grounding. This collaborative architecture is amplified by an agent-wise adaptive test-time scaling strategy that intelligently allocates computational resources based on the complexity and redundancy of each functionality. Evaluated on multiple visual document understanding benchmarks, MACT achieves superior performance with a smaller parameter scale, adapting effectively to various document scenarios without compromising its general or mathematical reasoning capabilities. The three variants of MACT consistently attain top-three average performance rankings, with average performance enhancements of 9.9-11.5% over the base models. The source code will be released publicly.
