Synthetic Document Question Answering in Hungarian
Jonathan Li, Zoltan Csaki, Nidhi Hiremath, Etash Guha, Fenglu Hong, Edward Ma, Urmish Thakker
TL;DR
This work tackles the scarcity of Hungarian DocVQA data by introducing scalable, multilingual data curation that leverages LLMs to create three Hungarian multimodal datasets: HuDocVQA-manual (human-verified), HuDocVQA (synthetic), and HuCCPDF (OCR-focused). The authors demonstrate that current VLMs underperform on Hungarian DocVQA compared with English and show that finetuning with a mixture of HuDocVQA and HuCCPDF data yields substantial gains (up to +7.2% on HuDocVQA with Llama 3.2 11B Instruct). A careful quality-filtering regime and model-merging strategies underpin improved data quality and performance, and the datasets/code will be released to accelerate multilingual DocVQA research. Overall, the approach provides a scalable blueprint for constructing high-quality document QA datasets in low-resource languages and can be generalized beyond Hungarian to other languages with limited data.
Abstract
Modern VLMs have achieved near-saturation accuracy in English document visual question-answering (VQA). However, this task remains challenging in lower resource languages due to a dearth of suitable training and evaluation data. In this paper we present scalable methods for curating such datasets by focusing on Hungarian, approximately the 17th highest resource language on the internet. Specifically, we present HuDocVQA and HuDocVQA-manual, document VQA datasets that modern VLMs significantly underperform on compared to English DocVQA. HuDocVQA-manual is a small manually curated dataset based on Hungarian documents from Common Crawl, while HuDocVQA is a larger synthetically generated VQA data set from the same source. We apply multiple rounds of quality filtering and deduplication to HuDocVQA in order to match human-level quality in this dataset. We also present HuCCPDF, a dataset of 117k pages from Hungarian Common Crawl PDFs along with their transcriptions, which can be used for training a model for Hungarian OCR. To validate the quality of our datasets, we show how finetuning on a mixture of these datasets can improve accuracy on HuDocVQA for Llama 3.2 11B Instruct by +7.2%. Our datasets and code will be released to the public to foster further research in multilingual DocVQA.
