Table of Contents
Fetching ...

Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement

Zhihang Yi, Jian Zhao, Jiancheng Lv, Tao Wang

TL;DR

This survey addresses the challenge of chart understanding, a multimodal information fusion task requiring the integration of graphical and textual cues. It traces the evolution from traditional CV-based chart analysis to modern Multimodal Large Language Models (MLLMs), emphasizing how fusion strategies and prompting enable richer reasoning over charts. The authors introduce an information-fusion-centric taxonomy, distinguishing canonical and non-canonical charts, and catalog downstream tasks and datasets to map current capabilities and gaps. They highlight limitations in perceptual fidelity and cognitive alignment, and advocate promising directions including intermediate representations, dynamic visual reasoning, and reinforcement learning with multi-agent debate to improve robustness and verifiability. Overall, the work provides a structured framework to guide future development of robust, trustworthy chart-understanding systems that combine precise perception with advanced reasoning.

Abstract

Chart understanding is a quintessential information fusion task, requiring the seamless integration of graphical and textual data to extract meaning. The advent of Multimodal Large Language Models (MLLMs) has revolutionized this domain, yet the landscape of MLLM-based chart analysis remains fragmented and lacks systematic organization. This survey provides a comprehensive roadmap of this nascent frontier by structuring the domain's core components. We begin by analyzing the fundamental challenges of fusing visual and linguistic information in charts. We then categorize downstream tasks and datasets, introducing a novel taxonomy of canonical and non-canonical benchmarks to highlight the field's expanding scope. Subsequently, we present a comprehensive evolution of methodologies, tracing the progression from classic deep learning techniques to state-of-the-art MLLM paradigms that leverage sophisticated fusion strategies. By critically examining the limitations of current models, particularly their perceptual and reasoning deficits, we identify promising future directions, including advanced alignment techniques and reinforcement learning for cognitive enhancement. This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion and to catalyze progress toward more robust and reliable systems.

Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement

TL;DR

This survey addresses the challenge of chart understanding, a multimodal information fusion task requiring the integration of graphical and textual cues. It traces the evolution from traditional CV-based chart analysis to modern Multimodal Large Language Models (MLLMs), emphasizing how fusion strategies and prompting enable richer reasoning over charts. The authors introduce an information-fusion-centric taxonomy, distinguishing canonical and non-canonical charts, and catalog downstream tasks and datasets to map current capabilities and gaps. They highlight limitations in perceptual fidelity and cognitive alignment, and advocate promising directions including intermediate representations, dynamic visual reasoning, and reinforcement learning with multi-agent debate to improve robustness and verifiability. Overall, the work provides a structured framework to guide future development of robust, trustworthy chart-understanding systems that combine precise perception with advanced reasoning.

Abstract

Chart understanding is a quintessential information fusion task, requiring the seamless integration of graphical and textual data to extract meaning. The advent of Multimodal Large Language Models (MLLMs) has revolutionized this domain, yet the landscape of MLLM-based chart analysis remains fragmented and lacks systematic organization. This survey provides a comprehensive roadmap of this nascent frontier by structuring the domain's core components. We begin by analyzing the fundamental challenges of fusing visual and linguistic information in charts. We then categorize downstream tasks and datasets, introducing a novel taxonomy of canonical and non-canonical benchmarks to highlight the field's expanding scope. Subsequently, we present a comprehensive evolution of methodologies, tracing the progression from classic deep learning techniques to state-of-the-art MLLM paradigms that leverage sophisticated fusion strategies. By critically examining the limitations of current models, particularly their perceptual and reasoning deficits, we identify promising future directions, including advanced alignment techniques and reinforcement learning for cognitive enhancement. This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion and to catalyze progress toward more robust and reliable systems.
Paper Structure (44 sections, 18 figures, 6 tables)

This paper contains 44 sections, 18 figures, 6 tables.

Figures (18)

  • Figure 1: The prevalence of charts. Tools like Bloomberg Terminal and AlphaSense can be utilized to analyze financial charts. Charts can be converted into voice descriptions through Apple VoiceOver and Microsoft Narrator, or tactile charts Moured_20242025-tactile-vega-lite for the benefits of the visually impaired. An automated Chart Understanding agent is capable of continuously monitoring electrocardiogram (ECG) signals on a 24/7 basis, thereby reducing the need for intensive human supervision. Chart Understanding models can be made into plugins and integrated into scientific literature readers, assisting in understanding scientific charts. Efforts like this have already been done by alphaXiv, Scholarcy, SciSpace and Scite. Methods in Chart Understanding can also be integrated into softwares like Tableau and Gapminder to provide solutions for business and public policy decision-making.
  • Figure 2: The arrangement of the survey.
  • Figure 3: Downstream tasks of chart understanding. These tasks are categorized into simple ones and advanced ones, each illustrated with an example. Simple tasks often rely on softmax classifiers or recurrent neural networks (RNNs) to generate outputs, while advanced tasks utilize more powerful auto-regressive language models based on a transformer to generate outputs.
  • Figure 4: Examples of canonical and non-canonical charts. Canonical charts include line charts, bar charts, histograms, and pie charts, scatter plots, and area charts. Non-canonical charts encompass a wide range of chart types, including and not limited to the charts illustrated in the figure above.
  • Figure 5: A comprehensive evolution from traditional machine learning approaches to MLLM-based methods. During the 1990s and 2000s, traditional methods like Histogram of Oriented gradients (HOG) article, Local Binary Patterns (LBP) 10.5555/2102160.2102213, Template Matching 953947, Chart Grammar 52080910.1145/108360.108361, Hough Transform 10.1145/1284420.1284427, Edge Detection inproceedings and Decision Tree inproceedings1 laid a solid foundation for chart understanding. After the rise of Deep Learning, methods shifted from handcrafted features to end-to-end learning, where Convolutional Neural Networks (CNNs) krizhevsky2012imagenet, R-CNN girshick2014richfeaturehierarchiesaccurate, Faster R-CNN ren2016fasterrcnnrealtimeobject automatically extracted features from raw image pixels. This phase saw the introduction of systems specifically designed for chart tasks. Examples include ReVision 10.1145/2047196.2047247, VIEW gao2012view, RARE shi2016robust, CRNN shi2016end, and ChartSense 10.1145/3025453.3025957. Dedicated models were developed to answer simple questions about charts, such as SAN-VQA and IMG-QUES kafle2018dvqa. Later approaches like LayoutLM xu2020layoutlm began integrating visual and textual features more closely. The current phase of chart understanding leverages the strong visual-language alignment and reasoning capabilities of MLLMs to process charts. General-purpose models are foundation models that can handle chart understanding as part of their general multimodal skills. Chart-specific models are fine-tuned specifically for chart data to mitigate the common issues of general MLLMs, such as hallucination. The figure also illustrates a future application, integration of Multimodal Large Language Models (MLLMs) into scientific reading environments to facilitate automated literature review.
  • ...and 13 more figures