Table of Contents
Fetching ...

Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism

Lei Kang, Rubèn Tito, Ernest Valveny, Dimosthenis Karatzas

TL;DR

This work tackles multi-page Document VQA under OCR-free constraints by leveraging Pix2Struct as a unified visualRepresentation backbone. A two-stage training scheme first optimizes a single-page VQA model, then freezes the encoder and trains a self-attention scoring module to retrieve the most relevant page(s) for a given question, enabling scalable evaluation across documents with many pages. The method achieves state-of-the-art-like page prediction while maintaining competitive ANLS scores, without relying on OCR and with fewer parameters than prior multi-page baselines, and demonstrates robustness to documents containing hundreds of pages. The authors provide an extended evaluation on full-length documents (up to 793 pages) and release their code, highlighting practical gains for real-world document understanding tasks.

Abstract

Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. Document Visual Question Answering (Document VQA), due to this multi-modal nature, has garnered significant interest from both the document understanding and natural language processing communities. The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle. They have to concatenate all pages into one large page for processing, demanding substantial GPU resources, even for evaluation. In this work, we propose a novel method and efficient training strategy for multi-page Document VQA tasks. In particular, we employ a visual-only document representation, leveraging the encoder from a document understanding model, Pix2Struct. Our approach utilizes a self-attention scoring mechanism to generate relevance scores for each document page, enabling the retrieval of pertinent pages. This adaptation allows us to extend single-page Document VQA models to multi-page scenarios without constraints on the number of pages during evaluation, all with minimal demand for GPU resources. Our extensive experiments demonstrate not only achieving state-of-the-art performance without the need for Optical Character Recognition (OCR), but also sustained performance in scenarios extending to documents of nearly 800 pages compared to a maximum of 20 pages in the MP-DocVQA dataset. Our code is publicly available at \url{https://github.com/leitro/SelfAttnScoring-MPDocVQA}.

Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism

TL;DR

This work tackles multi-page Document VQA under OCR-free constraints by leveraging Pix2Struct as a unified visualRepresentation backbone. A two-stage training scheme first optimizes a single-page VQA model, then freezes the encoder and trains a self-attention scoring module to retrieve the most relevant page(s) for a given question, enabling scalable evaluation across documents with many pages. The method achieves state-of-the-art-like page prediction while maintaining competitive ANLS scores, without relying on OCR and with fewer parameters than prior multi-page baselines, and demonstrates robustness to documents containing hundreds of pages. The authors provide an extended evaluation on full-length documents (up to 793 pages) and release their code, highlighting practical gains for real-world document understanding tasks.

Abstract

Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. Document Visual Question Answering (Document VQA), due to this multi-modal nature, has garnered significant interest from both the document understanding and natural language processing communities. The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle. They have to concatenate all pages into one large page for processing, demanding substantial GPU resources, even for evaluation. In this work, we propose a novel method and efficient training strategy for multi-page Document VQA tasks. In particular, we employ a visual-only document representation, leveraging the encoder from a document understanding model, Pix2Struct. Our approach utilizes a self-attention scoring mechanism to generate relevance scores for each document page, enabling the retrieval of pertinent pages. This adaptation allows us to extend single-page Document VQA models to multi-page scenarios without constraints on the number of pages during evaluation, all with minimal demand for GPU resources. Our extensive experiments demonstrate not only achieving state-of-the-art performance without the need for Optical Character Recognition (OCR), but also sustained performance in scenarios extending to documents of nearly 800 pages compared to a maximum of 20 pages in the MP-DocVQA dataset. Our code is publicly available at \url{https://github.com/leitro/SelfAttnScoring-MPDocVQA}.
Paper Structure (13 sections, 9 figures, 4 tables)

This paper contains 13 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Transforming natural language questions into visual modality. Note that the blue color solely serves for the purpose of highlighting the text within the image, which is not integrated into the final concatenated image.
  • Figure 2: Overview of the proposed framework. The training process involves two steps: initially training the single-page model in the upper green block with positive question-page pairs only, then freezing the single-page model and training the self-attention scoring module in the red block with positive and negative question-page pairs to retrieve the most relevant document pages. The evaluation is indicated by blue arrows.
  • Figure 3: The architecture of Self-Attention Scoring Module.
  • Figure 4: Histograms of the number of pages in the original test set documents and the extended version with full pages, depicting in green and blue charts, respectively.
  • Figure 5: Heatmap of page prediction accuracy ($\%$) on validation set.
  • ...and 4 more figures