Survey on Question Answering over Visually Rich Documents: Methods, Challenges, and Trends
Camille Barboule, Benjamin Piwowarski, Yoan Chabot
TL;DR
The paper surveys visually-rich document understanding (VrDU), detailing how VrDs are encoded via text+layout+visual features and how large language models (LLMs) can decode these multimodal representations for tasks like question answering. It contrasts structured-encoding approaches (integrating T, L, V with 2D positional cues and cross-modal fusion) against vision-only strategies that treat VrDs as images and rely on LVLMs, highlighting scalability, efficiency, and accuracy trade-offs, especially for multi-page documents. The authors synthesize techniques for integrating visual features into LLMs (self-attention vs cross-attention) and discuss pretraining strategies, concluding that multi-modal, cross-modal fusion with robust 2D position encoding and sparse or hierarchical attention holds the most promise for practical VrDU. They also identify gaps in evaluation consistency and domain coverage, urging standardized benchmarks and expansions beyond VQA to real-world document tasks. Overall, the survey provides a taxonomy and guidance for advancing VrDU across structured, vision-only, and multi-page regimes, with practical implications for document analysis pipelines and AI-assisted reading systems.
Abstract
The field of visually-rich document understanding, which involves interacting with visually-rich documents (whether scanned or born-digital), is rapidly evolving and still lacks consensus on several key aspects of the processing pipeline. In this work, we provide a comprehensive overview of state-of-the-art approaches, emphasizing their strengths and limitations, pointing out the main challenges in the field, and proposing promising research directions.
