Table of Contents
Fetching ...

HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images

Yuchen Yang, Haoran Yan, Yanhao Chen, Qingqiang Wu, Qingqi Hong

TL;DR

Current vision-language QA systems struggle to interpret human annotations in text-heavy images, limiting real-world applicability. The authors define the HAUR task and curate the HAUR-5 dataset with five annotation styles, introducing OCR-Mix, a Pix2Struct-based architecture that fuses OCR-extracted text with image features via multi-layer cross-attention to generate text outputs. OCR-Mix achieves state-of-the-art results on HAUR-5, with ANLS > 0.95 and ACC > 90% in most cases, and demonstrates robustness in ablation tests and real-world (in-wild) data. The work provides a practical resource and methodology to enhance human-annotation understanding in multimodal systems and can serve as an external tool to improve VL models in real-world QA tasks.

Abstract

Vision Question Answering (VQA) tasks use images to convey critical information to answer text-based questions, which is one of the most common forms of question answering in real-world scenarios. Numerous vision-text models exist today and have performed well on certain VQA tasks. However, these models exhibit significant limitations in understanding human annotations on text-heavy images. To address this, we propose the Human Annotation Understanding and Recognition (HAUR) task. As part of this effort, we introduce the Human Annotation Understanding and Recognition-5 (HAUR-5) dataset, which encompasses five common types of human annotations. Additionally, we developed and trained our model, OCR-Mix. Through comprehensive cross-model comparisons, our results demonstrate that OCR-Mix outperforms other models in this task. Our dataset and model will be released soon .

HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images

TL;DR

Current vision-language QA systems struggle to interpret human annotations in text-heavy images, limiting real-world applicability. The authors define the HAUR task and curate the HAUR-5 dataset with five annotation styles, introducing OCR-Mix, a Pix2Struct-based architecture that fuses OCR-extracted text with image features via multi-layer cross-attention to generate text outputs. OCR-Mix achieves state-of-the-art results on HAUR-5, with ANLS > 0.95 and ACC > 90% in most cases, and demonstrates robustness in ablation tests and real-world (in-wild) data. The work provides a practical resource and methodology to enhance human-annotation understanding in multimodal systems and can serve as an external tool to improve VL models in real-world QA tasks.

Abstract

Vision Question Answering (VQA) tasks use images to convey critical information to answer text-based questions, which is one of the most common forms of question answering in real-world scenarios. Numerous vision-text models exist today and have performed well on certain VQA tasks. However, these models exhibit significant limitations in understanding human annotations on text-heavy images. To address this, we propose the Human Annotation Understanding and Recognition (HAUR) task. As part of this effort, we introduce the Human Annotation Understanding and Recognition-5 (HAUR-5) dataset, which encompasses five common types of human annotations. Additionally, we developed and trained our model, OCR-Mix. Through comprehensive cross-model comparisons, our results demonstrate that OCR-Mix outperforms other models in this task. Our dataset and model will be released soon .

Paper Structure

This paper contains 28 sections, 5 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: A real-life scenario to explain the motivation behind our task.
  • Figure 2: The architecture of our OCR-Mix Model.
  • Figure 3: A simple image showcasing the contents of our HAUR-5 dataset: The dataset includes five common types of human annotations.
  • Figure 4: Acc with Different Fusion Module Layers
  • Figure 5: Experiment with human-labeled images in real-world scenarios: Red text represents the difference between the model’s prediction and the actual value.
  • ...and 9 more figures