mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

Jingxuan Wei; Nan Xu; Guiyong Chang; Yin Luo; BiHui Yu; Ruifeng Guo

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

Jingxuan Wei, Nan Xu, Guiyong Chang, Yin Luo, BiHui Yu, Ruifeng Guo

TL;DR

mChartQA addresses multimodal chart question answering by integrating vision-language alignment with structured reasoning through a four-component architecture (Vision Encoder, Connector, Chart-to-Text Engine, and LLM). It employs a two-stage training regime—Stage 1 for visual-language alignment and Stage 2 for visual-language reasoning—augmented by a DePlot-based chart-to-text initialization. Evaluations on ChartQA, FigureQA, and PlotQA demonstrate strong performance, with the Intern-LM2 variant delivering the strongest overall results and ablations underscoring the importance of DePlot and the cross-attention connector. The work highlights the value of preserving visual chart details while enabling deep reasoning, offering a scalable benchmark and a path toward universal multimodal chart QA.

Abstract

In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by language model analysis, have limitations in effectively handling these complex scenarios. This paper introduces a novel multimodal chart question-answering model, specifically designed to address these intricate tasks. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. We adopt a dual-phase training approach: the initial phase focuses on aligning image and text representations, while the subsequent phase concentrates on optimizing the model's interpretative and analytical abilities in chart-related queries. This approach has demonstrated superior performance on multiple public datasets, particularly in handling color, structure, and textless chart questions, indicating its effectiveness in complex multimodal tasks.

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

TL;DR

Abstract

Paper Structure (15 sections, 3 equations, 6 figures, 7 tables)

This paper contains 15 sections, 3 equations, 6 figures, 7 tables.

Introduction
Related Work
Method
Architecture
Training
Experiment
Datasets
Baselines
Experimental Setting
Main Results
Ablation Study
Further Analysis
Case Study
Error Analysis
Conclusion

Figures (6)

Figure 1: Examples of Color, Structure, and Textless Charts
Figure 2: The training architecture and workflow of the mChartQA model.
Figure 3: Format for Stage 1 training data includes Captioning, Grounding, and Chart-to-Text tasks. The prefix sequence is in black text, while the correct label is in red text.
Figure 4: Example test dataset extracted from the ChartQA, PlotQA, and FigureQA datasets, with the test set and example type displayed in green.
Figure 5: Case study example.
...and 1 more figures

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

TL;DR

Abstract

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)