Table of Contents
Fetching ...

GRAM: Global Reasoning for Multi-Page VQA

Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, Ron Litman

TL;DR

This work presents GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining, and proposes a tailored bias adaptation method to enforce the newly introduced document tokens.

Abstract

The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining. To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document tokens, we propose a tailored bias adaptation method. For additional computational savings during decoding, we introduce an optional compression stage using our compression-transformer (C-Former),reducing the encoded sequence length, thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.

GRAM: Global Reasoning for Multi-Page VQA

TL;DR

This work presents GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining, and proposes a tailored bias adaptation method to enforce the newly introduced document tokens.

Abstract

The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining. To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document tokens, we propose a tailored bias adaptation method. For additional computational savings during decoding, we introduce an optional compression stage using our compression-transformer (C-Former),reducing the encoded sequence length, thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.
Paper Structure (23 sections, 6 equations, 9 figures, 6 tables)

This paper contains 23 sections, 6 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: An Overview of GRAM. We suggest an interleaved encoder architecture combining page- with document-attention layers, allowing information to propagate between different pages. An optional compression transformer (C-former) is introduced to allow a trade-off between quality and latency.
  • Figure 2: GRAM Architecture.(a) Depicts a high-level architecture overview. For each page, the visual, textual and question tokens are concatenated together with learnable doc tokens (darker color shade). The processed information is fed into the multi-page encoder. The encoder output can be fed directly into the decoder to create the final prediction. Optionally, a compression model, C-Former , can be used between the encoder and the decoder to compress the encoder output into a predetermined length, thus reducing overall latency for long documents. (b) Shows a global-local encoder layer, containing two sub-layers. The first sub-layer uses self-attention that operates on each page separately, while the second applies a self-attention step on the doc tokens to fuse information between the different pages. The corresponding tokens are then routed back to their respective page and go into the next global-local encoder layer.
  • Figure 3: Global-Local Attention: In long sequence approaches (a), attention is applied jointly to the entire sequence of concatenated local and global tokens. Our method, separates the computation into two steps — page-level (b) and document-level (c)— leveraging the natural division of documents into pages.
  • Figure 4: Qualitative comparison between our approach and Hi-VT5 tito2022hierarchical indicate that the integration of our global-local encoder enhances reasoning capabilities, especially when the inquiries require multi-page context.
  • Figure 5: Latency comparison. We compare the dependency between overall latency and the number of pages in input document for GRAM, GRAM$_{C-Former}$, $\text{DocFormerv2}_{concat}$ and $\text{Hi-VT5}$.
  • ...and 4 more figures