Table of Contents
Fetching ...

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji

TL;DR

This survey analyzes automatic chart understanding in the era of large foundation models, detailing the problem formulation, data ecosystems, and modeling paradigms that range from traditional classification to end-to-end generation and LVLMs. It highlights the shift from OCR-dependent pipelines to OCR-free, pre-trained, and instruction-tuned architectures, and discusses tool-augmented strategies that bridge perception and reasoning. The work assesses state-of-the-art performance across tasks like chart question answering, captioning, and chart-to-table conversion, while emphasizing the need for robust evaluation metrics that capture faithfulness, coverage, and fairness. It also identifies challenges such as domain-specific charts, multilingual understanding, and the lack of standardized evaluation frameworks, offering concrete directions for future research and practical deployment. Overall, the paper positions LVLMs and OCR-free chart understanding as central to scalable, accurate extraction of insights from visual data representations.

Abstract

Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models, have revolutionized various natural language processing tasks and are increasingly being applied to chart understanding tasks. This survey paper provides a comprehensive overview of the recent developments, challenges, and future directions in chart understanding within the context of these foundation models. We review fundamental building blocks crucial for studying chart understanding tasks. Additionally, we explore various tasks and their evaluation metrics and sources of both charts and textual inputs. Various modeling strategies are then examined, encompassing both classification-based and generation-based approaches, along with tool augmentation techniques that enhance chart understanding performance. Furthermore, we discuss the state-of-the-art performance of each task and discuss how we can improve the performance. Challenges and future directions are addressed, highlighting the importance of several topics, such as domain-specific charts, lack of efforts in developing evaluation metrics, and agent-oriented settings. This survey paper serves as a comprehensive resource for researchers and practitioners in the fields of natural language processing, computer vision, and data analysis, providing valuable insights and directions for future research in chart understanding leveraging large foundation models. The studies mentioned in this paper, along with emerging new research, will be continually updated at: https://github.com/khuangaf/Awesome-Chart-Understanding.

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

TL;DR

This survey analyzes automatic chart understanding in the era of large foundation models, detailing the problem formulation, data ecosystems, and modeling paradigms that range from traditional classification to end-to-end generation and LVLMs. It highlights the shift from OCR-dependent pipelines to OCR-free, pre-trained, and instruction-tuned architectures, and discusses tool-augmented strategies that bridge perception and reasoning. The work assesses state-of-the-art performance across tasks like chart question answering, captioning, and chart-to-table conversion, while emphasizing the need for robust evaluation metrics that capture faithfulness, coverage, and fairness. It also identifies challenges such as domain-specific charts, multilingual understanding, and the lack of standardized evaluation frameworks, offering concrete directions for future research and practical deployment. Overall, the paper positions LVLMs and OCR-free chart understanding as central to scalable, accurate extraction of insights from visual data representations.

Abstract

Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models, have revolutionized various natural language processing tasks and are increasingly being applied to chart understanding tasks. This survey paper provides a comprehensive overview of the recent developments, challenges, and future directions in chart understanding within the context of these foundation models. We review fundamental building blocks crucial for studying chart understanding tasks. Additionally, we explore various tasks and their evaluation metrics and sources of both charts and textual inputs. Various modeling strategies are then examined, encompassing both classification-based and generation-based approaches, along with tool augmentation techniques that enhance chart understanding performance. Furthermore, we discuss the state-of-the-art performance of each task and discuss how we can improve the performance. Challenges and future directions are addressed, highlighting the importance of several topics, such as domain-specific charts, lack of efforts in developing evaluation metrics, and agent-oriented settings. This survey paper serves as a comprehensive resource for researchers and practitioners in the fields of natural language processing, computer vision, and data analysis, providing valuable insights and directions for future research in chart understanding leveraging large foundation models. The studies mentioned in this paper, along with emerging new research, will be continually updated at: https://github.com/khuangaf/Awesome-Chart-Understanding.
Paper Structure (34 sections, 12 equations, 3 figures, 6 tables)

This paper contains 34 sections, 12 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An overview of different chart understanding tasks. Each task is illustrated with one example.
  • Figure 2: A taxonomy of chart understanding approaches with representative work.
  • Figure 3: A comparison between (small) pre-trained vision-language models and LVLMs. In addition to the scale of models, the biggest difference between these two types of models is that LVLMs do not need task-specific fine-tuning since instruction-tuning allows them to generalize to unseen tasks.