DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding

Yuchuan Wu; Minghan Zhuo; Teng Fu; Mengyang Zhao; Bin Li; Xiangyang Xue

DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding

Yuchuan Wu, Minghan Zhuo, Teng Fu, Mengyang Zhao, Bin Li, Xiangyang Xue

TL;DR

DocCogito is proposed, a unified framework that integrates global layout perception with structured, region-grounded reasoning and introduces a lightweight layout tower that distills page structure into learnable global layout prior tokens, and a deterministic Visual-Semantic Chain to supervise fine-grained intermediate reasoning aligned with evidence regions.

Abstract

Document understanding with multimodal large language models (MLLMs) requires not only accurate answers but also explicit, evidence-grounded reasoning, especially in high-stakes scenarios. However, current document MLLMs still fall short of forming a complete, human-like reasoning process, because even when they improve both layout encoding and CoT-style prompting, the interaction between the two is typically learned implicitly and remains loosely coupled rather than being enforced as a systematic mechanism. So we propose DocCogito, a unified framework that integrates global layout perception with structured, region-grounded reasoning. DocCogito introduces a lightweight layout tower that distills page structure into learnable global layout prior tokens, and a deterministic Visual-Semantic Chain (VSC)-a concise structured representation less ambiguous than free-form natural-language CoT-to supervise fine-grained intermediate reasoning aligned with evidence regions. Training follows a progressive recipe, including layout perception pretraining, VSC-guided cold start, rejection sampling, and GRPO. To further strengthen the internal coupling between layout priors and VSC execution, we augment standard rewards with a fine-grained region-confidence signal that encourages reasoning traces to stay aligned with corresponding evidence regions. Extensive experiments on six benchmarks (DocVQA, WTQ, ChartQA, TextVQA, OCRBench, and InfoVQA) demonstrate strong generalization, achieving state-of-the-art results on four benchmarks.

DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding

TL;DR

Abstract

Paper Structure (19 sections, 13 equations, 5 figures, 4 tables)

This paper contains 19 sections, 13 equations, 5 figures, 4 tables.

Introduction
Related Work
Document Understanding
Reinforcement Learning
Methodology
Dataset Construction
Visual-Semantic Chain (VSC)
Model Architecture
Training Recipe
Layout Perception Pretraining
Multi-stage Post-training
Reward Formulation
Experiments
Datasets and Metrics
Implementation Details
...and 4 more sections

Figures (5)

Figure 1: Overview of reasoning pipelines in document understanding models. (a) Traditional text-only CoT. (b) DocLayLLM with OCR-based text/boxes and multiple CoT templates. (c) LayoutLLM with OCR-based text/boxes and staged reasoning (question analysis $\rightarrow$ area concentration $\rightarrow$ answer formation). (d) Ours, an OCR-free approach that integrates global layout perception with structured, region-grounded reasoning. VSC means Visual-Semantic Chain.
Figure 2: Overview of our methods. (a) The model architecture, where a lightweight layout tower extracts global layout cues and injects a [LAYOUT] token into the LLM. (b) The two-stage training recipe, consisting of layout perception pretraining and multi-stage GRPO.
Figure 3: OCRBench subtask breakdown for the top three overall models (Marten, Mini-Monkey, and DocCogito-8B) under the official evaluation protocol.
Figure 4: A VSC-style CoT example showing question analysis, region grounding, and operator-level reasoning steps that lead to the final answer.
Figure 5: Qualitative examples of our VSC-style CoT across diverse document types.

DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding

TL;DR

Abstract

DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (5)