Table of Contents
Fetching ...

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

Jingxuan Wei, Nan Xu, Guiyong Chang, Yin Luo, BiHui Yu, Ruifeng Guo

TL;DR

mChartQA addresses multimodal chart question answering by integrating vision-language alignment with structured reasoning through a four-component architecture (Vision Encoder, Connector, Chart-to-Text Engine, and LLM). It employs a two-stage training regime—Stage 1 for visual-language alignment and Stage 2 for visual-language reasoning—augmented by a DePlot-based chart-to-text initialization. Evaluations on ChartQA, FigureQA, and PlotQA demonstrate strong performance, with the Intern-LM2 variant delivering the strongest overall results and ablations underscoring the importance of DePlot and the cross-attention connector. The work highlights the value of preserving visual chart details while enabling deep reasoning, offering a scalable benchmark and a path toward universal multimodal chart QA.

Abstract

In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by language model analysis, have limitations in effectively handling these complex scenarios. This paper introduces a novel multimodal chart question-answering model, specifically designed to address these intricate tasks. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. We adopt a dual-phase training approach: the initial phase focuses on aligning image and text representations, while the subsequent phase concentrates on optimizing the model's interpretative and analytical abilities in chart-related queries. This approach has demonstrated superior performance on multiple public datasets, particularly in handling color, structure, and textless chart questions, indicating its effectiveness in complex multimodal tasks.

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

TL;DR

mChartQA addresses multimodal chart question answering by integrating vision-language alignment with structured reasoning through a four-component architecture (Vision Encoder, Connector, Chart-to-Text Engine, and LLM). It employs a two-stage training regime—Stage 1 for visual-language alignment and Stage 2 for visual-language reasoning—augmented by a DePlot-based chart-to-text initialization. Evaluations on ChartQA, FigureQA, and PlotQA demonstrate strong performance, with the Intern-LM2 variant delivering the strongest overall results and ablations underscoring the importance of DePlot and the cross-attention connector. The work highlights the value of preserving visual chart details while enabling deep reasoning, offering a scalable benchmark and a path toward universal multimodal chart QA.

Abstract

In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by language model analysis, have limitations in effectively handling these complex scenarios. This paper introduces a novel multimodal chart question-answering model, specifically designed to address these intricate tasks. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. We adopt a dual-phase training approach: the initial phase focuses on aligning image and text representations, while the subsequent phase concentrates on optimizing the model's interpretative and analytical abilities in chart-related queries. This approach has demonstrated superior performance on multiple public datasets, particularly in handling color, structure, and textless chart questions, indicating its effectiveness in complex multimodal tasks.
Paper Structure (15 sections, 3 equations, 6 figures, 7 tables)

This paper contains 15 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Examples of Color, Structure, and Textless Charts
  • Figure 2: The training architecture and workflow of the mChartQA model.
  • Figure 3: Format for Stage 1 training data includes Captioning, Grounding, and Chart-to-Text tasks. The prefix sequence is in black text, while the correct label is in red text.
  • Figure 4: Example test dataset extracted from the ChartQA, PlotQA, and FigureQA datasets, with the test set and example type displayed in green.
  • Figure 5: Case study example.
  • ...and 1 more figures