Graph-Based Multimodal Contrastive Learning for Chart Question Answering
Yue Dai, Soyeon Caren Han, Wei Liu
TL;DR
This work tackles ChartQA by introducing a joint multimodal scene-graph framework that explicitly models relationships among chart elements using both visual and textual graphs, and aligns their representations through graph contrastive learning. The unified multimodal graph embeddings are injected into a transformer decoder as soft prompts, enabling improved reasoning in chart-based questions. To address hallucinations in multimodal large language models, the authors propose Chain-of-Thought prompting with multi-format reasoning prompts and a follow-up prompt to produce direct answers. Extensive experiments on ChartQA, OpenCQA, and ChartX show significant performance gains, demonstrating the efficacy of graph-based multimodal fusion and CoT strategies for robust chart understanding and reasoning in zero-shot and few-shot contexts.
Abstract
Chart question answering (ChartQA) is challenged by the heterogeneous composition of chart elements and the subtle data patterns they encode. This work introduces a novel joint multimodal scene graph framework that explicitly models the relationships among chart components and their underlying structures. The framework integrates both visual and textual graphs to capture structural and semantic characteristics, while a graph contrastive learning strategy aligns node representations across modalities enabling their seamless incorporation into a transformer decoder as soft prompts. Moreover, a set of tailored Chain of Thought (CoT) prompts is proposed to enhance multimodal large language models (MLLMs) in zero-s ot scenarios by mitigating hallucinations. Extensive evaluations on benchmarks including ChartQA, OpenCQA, and ChartX demonstrate significant performance improvements and validate the efficacy of the proposed approach.
