Table of Contents
Fetching ...

ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents

Tianyu Yang, Terry Ruas, Yijun Tian, Jan Philip Wahle, Daniel Kurzawe, Bela Gipp

TL;DR

ALDEN tackles long visually rich document understanding by training Vision-Language Models as autonomous agents in a multi-turn reinforcement learning framework. It introduces an expanded action space with a fetch operation, a cross-level reward for fine-grained supervision, and a visual semantic anchoring mechanism to stabilize training across many visual tokens. Trained on a diverse corpus blended from three VRDU datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks, demonstrating robust evidence gathering and adaptive navigation. The work shifts from passive document reading to active, strategic navigation and reasoning over long multimodal documents, with practical implications for scalable VRDU systems.

Abstract

Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.

ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents

TL;DR

ALDEN tackles long visually rich document understanding by training Vision-Language Models as autonomous agents in a multi-turn reinforcement learning framework. It introduces an expanded action space with a fetch operation, a cross-level reward for fine-grained supervision, and a visual semantic anchoring mechanism to stabilize training across many visual tokens. Trained on a diverse corpus blended from three VRDU datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks, demonstrating robust evidence gathering and adaptive navigation. The work shifts from passive document reading to active, strategic navigation and reasoning over long multimodal documents, with practical implications for scalable VRDU systems.

Abstract

Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.

Paper Structure

This paper contains 24 sections, 13 equations, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the rollout process. At each turn: (1) the VLM generates a response conditioned on the dialogue history; (2) the response is parsed into an action (search, fetch, or answer); (3) the action is executed, where search or fetch collect document pages and answer terminates the process; and (4) the cross-level reward function assigns rewards based on execution outcomes and parsing results.
  • Figure 2: Overview of RL training in ALDEN. The policy model generates multi-turn trajectories, which are scored by a cross-level reward function and a value model. Turn-level GAE integrates future rewards to update the cross-level reward, and token-level GAE produces advantages for policy updates. A reference model supplies logits for both generated and visual tokens, which the visual semantic anchoring mechanism uses to constrain hidden-state evolution during optimization.
  • Figure 3: Training dynamics of ALDEN with and without Visual Semantic Anchoring (VSA). Panel (a) shows the turn-level reward of the answer action, panel (b) shows token-level entropy, panel (c) and (d) plot the KL divergence of visual tokens and generated tokens respectively.