Table of Contents
Fetching ...

InfoDet: A Dataset for Infographic Element Detection

Jiangning Zhu, Yuxing Zhou, Zheng Wang, Juntao Yao, Yima Gu, Yuhui Yuan, Shixia Liu

TL;DR

InfoDet introduces a large-scale infographic element detection dataset with 101,264 infographics (11,264 real and 90,000 synthetic) and 14.2M bounding-box annotations, spanning texts, charts, HROs, and sub-elements. It combines programmatic annotation for synthetic data and model-in-the-loop annotation for real data to boot a high-quality detector (InternImage-based) used across tasks. Three applications—Thinking-with-Boxes for grounded chart reasoning, comprehensive detector benchmarking, and cross-domain graphic layout detection—demonstrate its utility and generalizability. The results show that fine-tuning traditional detectors on InfoDet yields strong performance and better generalization to related domains, highlighting the dataset’s value for robust visual grounding in infographics. This work addresses a critical gap in chart understanding for vision-language models by providing rich, diverse, and scalable annotations tailored to infographic designs.

Abstract

Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce InfoDet, a dataset designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 11,264 real and 90,000 synthetic infographics, with over 14 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of InfoDet through three applications: 1) constructing a Thinking-with-Boxes scheme to boost the chart understanding performance of VLMs, 2) comparing existing object detection models, and 3) applying the developed detection model to document layout and UI element detection.

InfoDet: A Dataset for Infographic Element Detection

TL;DR

InfoDet introduces a large-scale infographic element detection dataset with 101,264 infographics (11,264 real and 90,000 synthetic) and 14.2M bounding-box annotations, spanning texts, charts, HROs, and sub-elements. It combines programmatic annotation for synthetic data and model-in-the-loop annotation for real data to boot a high-quality detector (InternImage-based) used across tasks. Three applications—Thinking-with-Boxes for grounded chart reasoning, comprehensive detector benchmarking, and cross-domain graphic layout detection—demonstrate its utility and generalizability. The results show that fine-tuning traditional detectors on InfoDet yields strong performance and better generalization to related domains, highlighting the dataset’s value for robust visual grounding in infographics. This work addresses a critical gap in chart understanding for vision-language models by providing rich, diverse, and scalable annotations tailored to infographic designs.

Abstract

Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce InfoDet, a dataset designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 11,264 real and 90,000 synthetic infographics, with over 14 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of InfoDet through three applications: 1) constructing a Thinking-with-Boxes scheme to boost the chart understanding performance of VLMs, 2) comparing existing object detection models, and 3) applying the developed detection model to document layout and UI element detection.

Paper Structure

This paper contains 51 sections, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Key contributions: 1) An open-source dataset InfoDet. 2) Improvements on chart understanding, infographic element detection, and graphic layout detection.
  • Figure 2: The construction pipeline for the InfoDet dataset.
  • Figure 3: The Thinking-with-Boxes scheme: (a) the charts, HROs, and texts are detected and overlaid onto the original image to create annotated images with grounded elements; (b) the input of the grounded chain-of-thought method (B$_1$) and its ablated variants (B$_2$, B$_3$, B$_4$).
  • Figure 4: Grounded CoT guides the model to think step-by-step and achieve the correct answer.
  • Figure 5: Detection results of evaluated object detection models: (a) zero-shot prompting with DINO-X; (b) 4-shot prompting with T-Rex2; (c) 4-shot fine-tuning with Co-DETR; (d) fine-tuning on InfoDet with Co-DETR. Bounding boxes in colors are the predictions for charts and HROs.
  • ...and 4 more figures