Table of Contents
Fetching ...

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta

Abstract

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Abstract

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
Paper Structure (143 sections, 17 equations, 41 figures, 16 tables, 1 algorithm)

This paper contains 143 sections, 17 equations, 41 figures, 16 tables, 1 algorithm.

Figures (41)

  • Figure 1: Given a query $q$ over corpus $\mathcal{C}$, the system iteratively retrieves pages, reasons over visual and textual content, and aggregates evidence from multiple pages $\mathcal{E} = \{p_{i,j}, \ldots\}$ to produce a grounded answer $a$ with attribution. The process typically requires decomposing $q$, iterative retrieval, and synthesizing across $\mathcal{E}$.
  • Figure 2: Layout element density across document domains in MADQA. The heatmap shows the standardized (z-scored) concentration of individual layout elements within each domain. Pink indicates above-average density, while cyan indicates below-average density. A detailed discussion is provided in Appendix \ref{['app:layout_elemt_density']}.
  • Figure 3: Sample question (X-Doc). No single document covers the full period. The agent must retrieve both the 2018 report (covering 2014--2018) and the 2019 report, then extract and sum the relevant values. More examples provided in Appendix \ref{['app:examples']}.
  • Figure 4: Visual necessity in MADQA. 58% of the questions benefit from understanding Structured layouts, Tabular data, or Visual Artifacts (e.g., charts, stamps). The matrix highlights that multi-category dependencies (e.g., Structured + Artifacts) are a significant driver of benchmark difficulty.
  • Figure 5: Principled dev/test set selection. We evaluate every question based on Difficulty (mean accuracy) and Discrimination (point-biserial correlation). The Sentinel Pool ($\bullet$) captures the hardest items to preserve headroom, regardless of discrimination scores. For the remaining budget, we stratify questions into difficulty bins and greedily select those with the highest discrimination signal ($\bullet$), discarding questions with lower predictive power ($\bullet$). Data on the plot are illustrative.
  • ...and 36 more figures

Theorems & Definitions (2)

  • Definition 1.1: Document Collection
  • Definition 1.2: Agentic Document Collection VQA