Table of Contents
Fetching ...

Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning

Xingchen Zeng, Haichuan Lin, Yilin Ye, Wei Zeng

TL;DR

This paper identifies critical gaps in chart question answering with multimodal LLMs, notably distribution biases in ChartQA and insufficient adaptation to chart-specific visual encodings. It introduces a novel data engine—combining data filtering and data-generation guided by visualization references—to create a broader, higher-quality dataset and benchmark, alongside a model that unfreezes the vision encoder and uses a mixture-of-resolution strategy for fine-grained chart recognition. The approach achieves state-of-the-art or competitive results with substantially less training data than existing methods, demonstrating the value of reasoning-focused visual instructions and chart-aware data design. The work also provides a large, balanced benchmark and release-ready code and data, enabling further advances in practical CQA for charts.

Abstract

Emerging multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA). Recent efforts primarily focus on scaling up training datasets (i.e., charts, data tables, and question-answer (QA) pairs) through data collection and synthesis. However, our empirical study on existing MLLMs and CQA datasets reveals notable gaps. First, current data collection and synthesis focus on data volume and lack consideration of fine-grained visual encodings and QA tasks, resulting in unbalanced data distribution divergent from practical CQA scenarios. Second, existing work follows the training recipe of the base MLLMs initially designed for natural images, under-exploring the adaptation to unique chart characteristics, such as rich text elements. To fill the gap, we propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development. Specifically, we propose a novel data engine to effectively filter diverse and high-quality data from existing datasets and subsequently refine and augment the data using LLM-based generation techniques to better align with practical QA tasks and visual encodings. Then, to facilitate the adaptation to chart characteristics, we utilize the enriched data to train an MLLM by unfreezing the vision encoder and incorporating a mixture-of-resolution adaptation strategy for enhanced fine-grained recognition. Experimental results validate the effectiveness of our approach. Even with fewer training examples, our model consistently outperforms state-of-the-art CQA models on established benchmarks. We also contribute a dataset split as a benchmark for future research. Source codes and datasets of this paper are available at https://github.com/zengxingchen/ChartQA-MLLM.

Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning

TL;DR

This paper identifies critical gaps in chart question answering with multimodal LLMs, notably distribution biases in ChartQA and insufficient adaptation to chart-specific visual encodings. It introduces a novel data engine—combining data filtering and data-generation guided by visualization references—to create a broader, higher-quality dataset and benchmark, alongside a model that unfreezes the vision encoder and uses a mixture-of-resolution strategy for fine-grained chart recognition. The approach achieves state-of-the-art or competitive results with substantially less training data than existing methods, demonstrating the value of reasoning-focused visual instructions and chart-aware data design. The work also provides a large, balanced benchmark and release-ready code and data, enabling further advances in practical CQA for charts.

Abstract

Emerging multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA). Recent efforts primarily focus on scaling up training datasets (i.e., charts, data tables, and question-answer (QA) pairs) through data collection and synthesis. However, our empirical study on existing MLLMs and CQA datasets reveals notable gaps. First, current data collection and synthesis focus on data volume and lack consideration of fine-grained visual encodings and QA tasks, resulting in unbalanced data distribution divergent from practical CQA scenarios. Second, existing work follows the training recipe of the base MLLMs initially designed for natural images, under-exploring the adaptation to unique chart characteristics, such as rich text elements. To fill the gap, we propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development. Specifically, we propose a novel data engine to effectively filter diverse and high-quality data from existing datasets and subsequently refine and augment the data using LLM-based generation techniques to better align with practical QA tasks and visual encodings. Then, to facilitate the adaptation to chart characteristics, we utilize the enriched data to train an MLLM by unfreezing the vision encoder and incorporating a mixture-of-resolution adaptation strategy for enhanced fine-grained recognition. Experimental results validate the effectiveness of our approach. Even with fewer training examples, our model consistently outperforms state-of-the-art CQA models on established benchmarks. We also contribute a dataset split as a benchmark for future research. Source codes and datasets of this paper are available at https://github.com/zengxingchen/ChartQA-MLLM.
Paper Structure (31 sections, 2 equations, 6 figures, 7 tables)

This paper contains 31 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: A typical architecture of MLLM, consisting of image encoder, projector, and LLM. represents the concatenation process of image $\mathbf{X}_{\text{img}}$ and text tokens $\mathbf{X}_{\text{text}}$.
  • Figure 2: Comparison of chart distribution between ChartQAmasry2022chartqa and Beaglebattle2018beagle.
  • Figure 3: Three typical categories of failed reasons: recognition error, numerical error during inference, and knowledge error during inference. False reasoning steps are colored in red, while correct steps are colored in green.
  • Figure 4: Illustration of the data filtering process, encompassing classification and sampling. Classification aims to investigate the distribution of existing datasets across key categorical attributes, including chart types, layout, trend, number annotations, and data grouping. Subsequently, we conduct sampling based on the fine category.
  • Figure 5: Data generation pipeline. First, we conduct retrieval-augmented chart generation with a set of table-code pairs we collected. This results in a collection of images distributed evenly in the real-world chart space. Then, we conduct visualization-referenced encoding augmentation for each seed chart to further enrich the dataset's size and diversity.
  • ...and 1 more figures