Table of Contents
Fetching ...

Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding

Keliang Liu, Zizhi Chen, Mingcheng Li, Jingqun Tang, Dingkang Yang, Lihua Zhang

TL;DR

SLEUTH addresses long-document understanding by engineering high-quality context rather than expanding model capacity. It introduces a training-free, four-agent system that narrows retrieval, extracts structured textual and visual clues, filters visual noise, and adapts reasoning strategy to query difficulty, producing a concise evidence-dense multimodal context for final answers. Across four benchmarks, SLEUTH achieves state-of-the-art or best-in-class accuracy, with ablations validating the contribution of each agent and the benefits of a visual, page-grounded evidence representation. The work highlights the importance of context quality in multimodal long-document QA and offers a generally applicable, backbone-agnostic framework that enhances robustness and interpretability. Future work will address retriever dependence, enable learning-based adaptation, and extend to multilingual or handwritten documents.

Abstract

Document understanding is a long standing practical task. Vision Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the models judgment. While retrieval augmented generation mitigates this issue by filtering for question relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse to fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence dense multimodal context to generate the final prediction. SLEUTH is model agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long document benchmarks, achieving state of the art results. Ablation studies verify each modules effectiveness and confirm the benefits of our hierarchical refinement paradigm.

Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding

TL;DR

SLEUTH addresses long-document understanding by engineering high-quality context rather than expanding model capacity. It introduces a training-free, four-agent system that narrows retrieval, extracts structured textual and visual clues, filters visual noise, and adapts reasoning strategy to query difficulty, producing a concise evidence-dense multimodal context for final answers. Across four benchmarks, SLEUTH achieves state-of-the-art or best-in-class accuracy, with ablations validating the contribution of each agent and the benefits of a visual, page-grounded evidence representation. The work highlights the importance of context quality in multimodal long-document QA and offers a generally applicable, backbone-agnostic framework that enhances robustness and interpretability. Future work will address retriever dependence, enable learning-based adaptation, and extend to multilingual or handwritten documents.

Abstract

Document understanding is a long standing practical task. Vision Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the models judgment. While retrieval augmented generation mitigates this issue by filtering for question relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse to fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence dense multimodal context to generate the final prediction. SLEUTH is model agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long document benchmarks, achieving state of the art results. Ablation studies verify each modules effectiveness and confirm the benefits of our hierarchical refinement paradigm.

Paper Structure

This paper contains 23 sections, 10 equations, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Comparison with mainstream methods. (a) Strengthening reasoning via agent optimization; (b) Improving recall through retrieval augmentation; (c) Combining (a) and (b); (d) Our method focuses on constructing evidence-dense contexts.
  • Figure 2: Overall Framework. SLEUTH adopts a coarse-to-fine pipeline: (1) a visual retriever selects Top-$K$ pages; (2) Clue Discovery Agent records and refines evidence, Page Screening Agent filters irrelevant page images, (3) Difficulty Assessment Agent analyzes query complexity, and (4) Core Decision Agent reasons over the distilled, evidence-dense context.
  • Figure 3: Performance of SLEUTH compared with closed-source commercial models such as Gemini 2.5 pro comanici2025gemini, GPT-5 openai_gpt5_system_card_2025 on MMLongBench-Doc ma2024mmlongbench.
  • Figure 4: Case study. Compared with basic methods that rely solely on direct input of the top-5 retrieved pages, SLEUTH performs dynamic correction through multi-step evidence recording and page-wise filtering, effectively preventing hallucination accumulation caused by multimodal long-context inputs with complex layouts inputs.
  • Figure 5: Baseline comparison on MMLongBench-Doc. Our method yields a larger polygon across dimensions, consistent with compact, page-grounded evidence contexts.
  • ...and 4 more figures