HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images
Yuchen Yang, Haoran Yan, Yanhao Chen, Qingqiang Wu, Qingqi Hong
TL;DR
Current vision-language QA systems struggle to interpret human annotations in text-heavy images, limiting real-world applicability. The authors define the HAUR task and curate the HAUR-5 dataset with five annotation styles, introducing OCR-Mix, a Pix2Struct-based architecture that fuses OCR-extracted text with image features via multi-layer cross-attention to generate text outputs. OCR-Mix achieves state-of-the-art results on HAUR-5, with ANLS > 0.95 and ACC > 90% in most cases, and demonstrates robustness in ablation tests and real-world (in-wild) data. The work provides a practical resource and methodology to enhance human-annotation understanding in multimodal systems and can serve as an external tool to improve VL models in real-world QA tasks.
Abstract
Vision Question Answering (VQA) tasks use images to convey critical information to answer text-based questions, which is one of the most common forms of question answering in real-world scenarios. Numerous vision-text models exist today and have performed well on certain VQA tasks. However, these models exhibit significant limitations in understanding human annotations on text-heavy images. To address this, we propose the Human Annotation Understanding and Recognition (HAUR) task. As part of this effort, we introduce the Human Annotation Understanding and Recognition-5 (HAUR-5) dataset, which encompasses five common types of human annotations. Additionally, we developed and trained our model, OCR-Mix. Through comprehensive cross-model comparisons, our results demonstrate that OCR-Mix outperforms other models in this task. Our dataset and model will be released soon .
