Table of Contents
Fetching ...

ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding

Zhengzhuo Xu, Bowen Qu, Yiyan Qi, Sinan Du, Chengjin Xu, Chun Yuan, Jian Guo

TL;DR

ChartMoE tackles faithful chart understanding by introducing a Mixture-of-Experts connector to bridge chart visuals and LLMs, replacing the standard MLP with multiple specialized experts. Each expert is initialized via distinct alignment tasks using the ChartMoE-Align dataset (~1M quadruples) and is refined through a three-stage process: alignment pre-training, high-quality MMC knowledge learning, and chart-specific annealing with ChartQA/ChartGemma, all while preserving the base vision encoder and LLM. Empirically, ChartMoE delivers state-of-the-art ChartQA performance (84.64% @0.05) and strong results on ChartBench, ChartFC, and ChartCheck, with ablations confirming the benefit of diverse expert initialization and carefully staged training. The approach preserves general task capabilities, adds minimal computational overhead, and provides interpretable expert specialization for chart elements, enabling precise value extraction, chart editing, and code-based interactions.

Abstract

Automatic chart understanding is crucial for content comprehension and document parsing. Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in chart understanding through domain-specific alignment and fine-tuning. However, current MLLMs still struggle to provide faithful data and reliable analysis only based on charts. To address it, we propose ChartMoE, which employs the Mixture of Expert (MoE) architecture to replace the traditional linear projector to bridge the modality gap. Specifically, we train several linear connectors through distinct alignment tasks, which are utilized as the foundational initialization parameters for different experts. Additionally, we introduce ChartMoE-Align, a dataset with nearly 1 million chart-table-JSON-code quadruples to conduct three alignment tasks (chart-table/JSON/code). Combined with the vanilla connector, we initialize different experts diversely and adopt high-quality knowledge learning to further refine the MoE connector and LLM parameters. Extensive experiments demonstrate the effectiveness of the MoE connector and our initialization strategy, e.g., ChartMoE improves the accuracy of the previous state-of-the-art from 80.48\% to 84.64\% on the ChartQA benchmark.

ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding

TL;DR

ChartMoE tackles faithful chart understanding by introducing a Mixture-of-Experts connector to bridge chart visuals and LLMs, replacing the standard MLP with multiple specialized experts. Each expert is initialized via distinct alignment tasks using the ChartMoE-Align dataset (~1M quadruples) and is refined through a three-stage process: alignment pre-training, high-quality MMC knowledge learning, and chart-specific annealing with ChartQA/ChartGemma, all while preserving the base vision encoder and LLM. Empirically, ChartMoE delivers state-of-the-art ChartQA performance (84.64% @0.05) and strong results on ChartBench, ChartFC, and ChartCheck, with ablations confirming the benefit of diverse expert initialization and carefully staged training. The approach preserves general task capabilities, adds minimal computational overhead, and provides interpretable expert specialization for chart elements, enabling precise value extraction, chart editing, and code-based interactions.

Abstract

Automatic chart understanding is crucial for content comprehension and document parsing. Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in chart understanding through domain-specific alignment and fine-tuning. However, current MLLMs still struggle to provide faithful data and reliable analysis only based on charts. To address it, we propose ChartMoE, which employs the Mixture of Expert (MoE) architecture to replace the traditional linear projector to bridge the modality gap. Specifically, we train several linear connectors through distinct alignment tasks, which are utilized as the foundational initialization parameters for different experts. Additionally, we introduce ChartMoE-Align, a dataset with nearly 1 million chart-table-JSON-code quadruples to conduct three alignment tasks (chart-table/JSON/code). Combined with the vanilla connector, we initialize different experts diversely and adopt high-quality knowledge learning to further refine the MoE connector and LLM parameters. Extensive experiments demonstrate the effectiveness of the MoE connector and our initialization strategy, e.g., ChartMoE improves the accuracy of the previous state-of-the-art from 80.48\% to 84.64\% on the ChartQA benchmark.
Paper Structure (34 sections, 3 equations, 16 figures, 11 tables)

This paper contains 34 sections, 3 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Overview and capabilities of ChartMoE: We introduce a MoE architecture connector and provide visualizations of the top-1 expert selection (refer to Fig. \ref{['fig_token_wise_expert']} and Appendix \ref{['apdx_sec_visual_token']} for details). ChartMoE can extract highly precise values and provide flexible chart editing through code-based interactions.
  • Figure 2: Overview of proposed ChartMoE. (a) Examples of alignment instructions. (b) We conduct three different alignment tasks in parallel. (c) We initialize MoE connectors in four different manners and train the gate network, experts, and LoRA during the supervised fine-tuning stage.
  • Figure 3: Overview of ChartMoE-Align data generation pipeline. The charts are plotted by Python matplotlib.
  • Figure 4: Training loss of different initialization.
  • Figure 5: Top-2 selected expert distribution on ChartBench.
  • ...and 11 more figures