mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning
Jingxuan Wei, Nan Xu, Guiyong Chang, Yin Luo, BiHui Yu, Ruifeng Guo
TL;DR
mChartQA addresses multimodal chart question answering by integrating vision-language alignment with structured reasoning through a four-component architecture (Vision Encoder, Connector, Chart-to-Text Engine, and LLM). It employs a two-stage training regime—Stage 1 for visual-language alignment and Stage 2 for visual-language reasoning—augmented by a DePlot-based chart-to-text initialization. Evaluations on ChartQA, FigureQA, and PlotQA demonstrate strong performance, with the Intern-LM2 variant delivering the strongest overall results and ablations underscoring the importance of DePlot and the cross-attention connector. The work highlights the value of preserving visual chart details while enabling deep reasoning, offering a scalable benchmark and a path toward universal multimodal chart QA.
Abstract
In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by language model analysis, have limitations in effectively handling these complex scenarios. This paper introduces a novel multimodal chart question-answering model, specifically designed to address these intricate tasks. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. We adopt a dual-phase training approach: the initial phase focuses on aligning image and text representations, while the subsequent phase concentrates on optimizing the model's interpretative and analytical abilities in chart-related queries. This approach has demonstrated superior performance on multiple public datasets, particularly in handling color, structure, and textless chart questions, indicating its effectiveness in complex multimodal tasks.
