Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism
Lei Kang, Rubèn Tito, Ernest Valveny, Dimosthenis Karatzas
TL;DR
This work tackles multi-page Document VQA under OCR-free constraints by leveraging Pix2Struct as a unified visualRepresentation backbone. A two-stage training scheme first optimizes a single-page VQA model, then freezes the encoder and trains a self-attention scoring module to retrieve the most relevant page(s) for a given question, enabling scalable evaluation across documents with many pages. The method achieves state-of-the-art-like page prediction while maintaining competitive ANLS scores, without relying on OCR and with fewer parameters than prior multi-page baselines, and demonstrates robustness to documents containing hundreds of pages. The authors provide an extended evaluation on full-length documents (up to 793 pages) and release their code, highlighting practical gains for real-world document understanding tasks.
Abstract
Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. Document Visual Question Answering (Document VQA), due to this multi-modal nature, has garnered significant interest from both the document understanding and natural language processing communities. The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle. They have to concatenate all pages into one large page for processing, demanding substantial GPU resources, even for evaluation. In this work, we propose a novel method and efficient training strategy for multi-page Document VQA tasks. In particular, we employ a visual-only document representation, leveraging the encoder from a document understanding model, Pix2Struct. Our approach utilizes a self-attention scoring mechanism to generate relevance scores for each document page, enabling the retrieval of pertinent pages. This adaptation allows us to extend single-page Document VQA models to multi-page scenarios without constraints on the number of pages during evaluation, all with minimal demand for GPU resources. Our extensive experiments demonstrate not only achieving state-of-the-art performance without the need for Optical Character Recognition (OCR), but also sustained performance in scenarios extending to documents of nearly 800 pages compared to a maximum of 20 pages in the MP-DocVQA dataset. Our code is publicly available at \url{https://github.com/leitro/SelfAttnScoring-MPDocVQA}.
