Table of Contents
Fetching ...

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

Yiqiao Jin, Rachneet Kaur, Zhen Zeng, Sumitra Ganesh, Srijan Kumar

TL;DR

SlideAgent tackles the challenge of fine-grained understanding of multi-page visual documents by introducing a hierarchical agentic framework with global, page, and element levels. It builds a query-agnostic knowledge base $\\mathcal{K} = \\{\\mathcal{K}_g, \\mathcal{K}_p, \\mathcal{K}_e\\}$ during a Knowledge Construction stage, then performs retrieval-augmented reasoning to synthesize context-aware answers via level-specific reasoning outputs ($h_g, h_p, h_e$). Across SlideVQA, TechSlides, and FinSlides, SlideAgent yields substantial gains over both proprietary and open-source baselines (+7.9 and +9.8 respectively), driven by effective page-level grounding and element-level visual-textual grounding. The approach demonstrates robustness across backbones and query types, underscores the importance of page-level reasoning, and points toward scalable, metadata-free visual document understanding in domains like finance and education. Overall, SlideAgent provides interpretable, spatially grounded reasoning and broad applicability to multi-page presentations and infographics.

Abstract

Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

TL;DR

SlideAgent tackles the challenge of fine-grained understanding of multi-page visual documents by introducing a hierarchical agentic framework with global, page, and element levels. It builds a query-agnostic knowledge base during a Knowledge Construction stage, then performs retrieval-augmented reasoning to synthesize context-aware answers via level-specific reasoning outputs (). Across SlideVQA, TechSlides, and FinSlides, SlideAgent yields substantial gains over both proprietary and open-source baselines (+7.9 and +9.8 respectively), driven by effective page-level grounding and element-level visual-textual grounding. The approach demonstrates robustness across backbones and query types, underscores the importance of page-level reasoning, and points toward scalable, metadata-free visual document understanding in domains like finance and education. Overall, SlideAgent provides interpretable, spatially grounded reasoning and broad applicability to multi-page presentations and infographics.

Abstract

Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).

Paper Structure

This paper contains 59 sections, 6 equations, 9 figures, 12 tables, 1 algorithm.

Figures (9)

  • Figure 1: When given the full page, the LLM miscounts the number of product mix categories. After isolating the chart, it correctly identifies all eight categories, highlighting the importance of accurate element parsing.
  • Figure 2: SlideAgent generates knowledge about input slide decks in a hierarchical manner at 3 levels: global, page, and element. At each level, specialized agents generate query-agnostic knowledge during knowledge construction, then retrieve and reason over query-specific textual & visual knowledge during inference stage. Sample knowledge $\mathcal{K}$ generated by SlideAgent is in Appendix Figure \ref{['fig:sample_global_knowledge']},\ref{['fig:sample_page_knowledge']},\ref{['fig:sample_element_knowledge']} and answers generated by the agents are in Figure \ref{['fig:sample_answer']}.
  • Figure 3: Performance comparison among variants of SlideAgent with base model GPT-4o.
  • Figure 4: Accuracy of SlideAgent and base models (GPT-4o / InternVL3-8B) on different query types.
  • Figure 5: Example answers generated by SlideAgent. Agents at different levels work together to provide comprehensive responses.
  • ...and 4 more figures