Table of Contents
Fetching ...

LiGT: Layout-infused Generative Transformer for Visual Question Answering on Vietnamese Receipts

Thanh-Phong Le, Trung Le Chi Phan, Nghia Hieu Nguyen, Kiet Van Nguyen

TL;DR

This work introduces ReceiptVQA, the first large-scale Vietnamese document VQA dataset for receipts, and LiGT, a layout-infused generative transformer that integrates 2D layout hashing with pretrained language-model embeddings. LiGT uses LayoutHEI to encode spatial information without heavy modular additions, and relies on ViT5 as a Vietnamese-capable backbone to generate answers. Experiments show generative models outperform extractive baselines, and that multimodal integration (text, layout, and visuals) improves performance on the receipt domain; LiGT achieves competitive results against strong baselines and demonstrates transferability to other document VQA benchmarks. The study highlights the importance of multilingual, layout-aware, and generative approaches for Vietnamese document understanding and provides a foundation for further research in low-resource language document VQA.

Abstract

Document Visual Question Answering (Document VQA) challenges multimodal systems to holistically handle textual, layout, and visual modalities to provide appropriate answers. Document VQA has gained popularity in recent years due to the increasing amount of documents and the high demand for digitization. Nonetheless, most of document VQA datasets are developed in high-resource languages such as English. In this paper, we present ReceiptVQA (\textbf{Receipt} \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering), the initial large-scale document VQA dataset in Vietnamese dedicated to receipts, a document kind with high commercial potentials. The dataset encompasses \textbf{9,000+} receipt images and \textbf{60,000+} manually annotated question-answer pairs. In addition to our study, we introduce LiGT (\textbf{L}ayout-\textbf{i}nfused \textbf{G}enerative \textbf{T}ransformer), a layout-aware encoder-decoder architecture designed to leverage embedding layers of language models to operate layout embeddings, minimizing the use of additional neural modules. Experiments on ReceiptVQA show that our architecture yielded promising performance, achieving competitive results compared with outstanding baselines. Furthermore, throughout analyzing experimental results, we found evident patterns that employing encoder-only model architectures has considerable disadvantages in comparison to architectures that can generate answers. We also observed that it is necessary to combine multiple modalities to tackle our dataset, despite the critical role of semantic understanding from language models. We hope that our work will encourage and facilitate future development in Vietnamese document VQA, contributing to a diverse multimodal research community in the Vietnamese language.

LiGT: Layout-infused Generative Transformer for Visual Question Answering on Vietnamese Receipts

TL;DR

This work introduces ReceiptVQA, the first large-scale Vietnamese document VQA dataset for receipts, and LiGT, a layout-infused generative transformer that integrates 2D layout hashing with pretrained language-model embeddings. LiGT uses LayoutHEI to encode spatial information without heavy modular additions, and relies on ViT5 as a Vietnamese-capable backbone to generate answers. Experiments show generative models outperform extractive baselines, and that multimodal integration (text, layout, and visuals) improves performance on the receipt domain; LiGT achieves competitive results against strong baselines and demonstrates transferability to other document VQA benchmarks. The study highlights the importance of multilingual, layout-aware, and generative approaches for Vietnamese document understanding and provides a foundation for further research in low-resource language document VQA.

Abstract

Document Visual Question Answering (Document VQA) challenges multimodal systems to holistically handle textual, layout, and visual modalities to provide appropriate answers. Document VQA has gained popularity in recent years due to the increasing amount of documents and the high demand for digitization. Nonetheless, most of document VQA datasets are developed in high-resource languages such as English. In this paper, we present ReceiptVQA (\textbf{Receipt} \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering), the initial large-scale document VQA dataset in Vietnamese dedicated to receipts, a document kind with high commercial potentials. The dataset encompasses \textbf{9,000+} receipt images and \textbf{60,000+} manually annotated question-answer pairs. In addition to our study, we introduce LiGT (\textbf{L}ayout-\textbf{i}nfused \textbf{G}enerative \textbf{T}ransformer), a layout-aware encoder-decoder architecture designed to leverage embedding layers of language models to operate layout embeddings, minimizing the use of additional neural modules. Experiments on ReceiptVQA show that our architecture yielded promising performance, achieving competitive results compared with outstanding baselines. Furthermore, throughout analyzing experimental results, we found evident patterns that employing encoder-only model architectures has considerable disadvantages in comparison to architectures that can generate answers. We also observed that it is necessary to combine multiple modalities to tackle our dataset, despite the critical role of semantic understanding from language models. We hope that our work will encourage and facilitate future development in Vietnamese document VQA, contributing to a diverse multimodal research community in the Vietnamese language.

Paper Structure

This paper contains 46 sections, 8 equations, 17 figures, 16 tables.

Figures (17)

  • Figure 1: Examples of ReceiptVQA samples
  • Figure 2: Overview of ReceiptVQA dataset creation
  • Figure 3: Number of questions with a particular question length
  • Figure 4: Number of answers with a particular answer length
  • Figure 5: Number of annotations in each question type
  • ...and 12 more figures