Table of Contents
Fetching ...

Synthetic Document Question Answering in Hungarian

Jonathan Li, Zoltan Csaki, Nidhi Hiremath, Etash Guha, Fenglu Hong, Edward Ma, Urmish Thakker

TL;DR

This work tackles the scarcity of Hungarian DocVQA data by introducing scalable, multilingual data curation that leverages LLMs to create three Hungarian multimodal datasets: HuDocVQA-manual (human-verified), HuDocVQA (synthetic), and HuCCPDF (OCR-focused). The authors demonstrate that current VLMs underperform on Hungarian DocVQA compared with English and show that finetuning with a mixture of HuDocVQA and HuCCPDF data yields substantial gains (up to +7.2% on HuDocVQA with Llama 3.2 11B Instruct). A careful quality-filtering regime and model-merging strategies underpin improved data quality and performance, and the datasets/code will be released to accelerate multilingual DocVQA research. Overall, the approach provides a scalable blueprint for constructing high-quality document QA datasets in low-resource languages and can be generalized beyond Hungarian to other languages with limited data.

Abstract

Modern VLMs have achieved near-saturation accuracy in English document visual question-answering (VQA). However, this task remains challenging in lower resource languages due to a dearth of suitable training and evaluation data. In this paper we present scalable methods for curating such datasets by focusing on Hungarian, approximately the 17th highest resource language on the internet. Specifically, we present HuDocVQA and HuDocVQA-manual, document VQA datasets that modern VLMs significantly underperform on compared to English DocVQA. HuDocVQA-manual is a small manually curated dataset based on Hungarian documents from Common Crawl, while HuDocVQA is a larger synthetically generated VQA data set from the same source. We apply multiple rounds of quality filtering and deduplication to HuDocVQA in order to match human-level quality in this dataset. We also present HuCCPDF, a dataset of 117k pages from Hungarian Common Crawl PDFs along with their transcriptions, which can be used for training a model for Hungarian OCR. To validate the quality of our datasets, we show how finetuning on a mixture of these datasets can improve accuracy on HuDocVQA for Llama 3.2 11B Instruct by +7.2%. Our datasets and code will be released to the public to foster further research in multilingual DocVQA.

Synthetic Document Question Answering in Hungarian

TL;DR

This work tackles the scarcity of Hungarian DocVQA data by introducing scalable, multilingual data curation that leverages LLMs to create three Hungarian multimodal datasets: HuDocVQA-manual (human-verified), HuDocVQA (synthetic), and HuCCPDF (OCR-focused). The authors demonstrate that current VLMs underperform on Hungarian DocVQA compared with English and show that finetuning with a mixture of HuDocVQA and HuCCPDF data yields substantial gains (up to +7.2% on HuDocVQA with Llama 3.2 11B Instruct). A careful quality-filtering regime and model-merging strategies underpin improved data quality and performance, and the datasets/code will be released to accelerate multilingual DocVQA research. Overall, the approach provides a scalable blueprint for constructing high-quality document QA datasets in low-resource languages and can be generalized beyond Hungarian to other languages with limited data.

Abstract

Modern VLMs have achieved near-saturation accuracy in English document visual question-answering (VQA). However, this task remains challenging in lower resource languages due to a dearth of suitable training and evaluation data. In this paper we present scalable methods for curating such datasets by focusing on Hungarian, approximately the 17th highest resource language on the internet. Specifically, we present HuDocVQA and HuDocVQA-manual, document VQA datasets that modern VLMs significantly underperform on compared to English DocVQA. HuDocVQA-manual is a small manually curated dataset based on Hungarian documents from Common Crawl, while HuDocVQA is a larger synthetically generated VQA data set from the same source. We apply multiple rounds of quality filtering and deduplication to HuDocVQA in order to match human-level quality in this dataset. We also present HuCCPDF, a dataset of 117k pages from Hungarian Common Crawl PDFs along with their transcriptions, which can be used for training a model for Hungarian OCR. To validate the quality of our datasets, we show how finetuning on a mixture of these datasets can improve accuracy on HuDocVQA for Llama 3.2 11B Instruct by +7.2%. Our datasets and code will be released to the public to foster further research in multilingual DocVQA.

Paper Structure

This paper contains 26 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: An example question-answer pair from HuDocVQA-manual
  • Figure 2: A diagram of our synthetic data pipeline
  • Figure 3: Examples from HuDocVQA-manual where ANLS fails to correctly measure model response accuracy in HuDocVQA-manual
  • Figure 4: Examples from HuDocVQA-manual where GPT-4o would rate models more harshly than Llama 3.1 405B Instruct
  • Figure 5: The Hungarian system prompt we provide to Llama 3.3 70B Instruct for synthetic QA generation. English translation provided in italics.