ChartAdapter: Large Vision-Language Model for Chart Summarization

Peixin Xu; Yujuan Ding; Wenqi Fan

ChartAdapter: Large Vision-Language Model for Chart Summarization

Peixin Xu, Yujuan Ding, Wenqi Fan

TL;DR

ChartAdapter introduces a cross-modal bridge between chart encoders and large language models by leveraging learnable queries and a cross-modal projector to extract implicit chart semantics. The architecture couples four components—cross-modal projector, latent textual embeddings, cross-modal interaction, and implicit semantic decoder—within a transformer-based framework and translates visual chart features into textual summaries via an end-to-end LLM pipeline. A multi-stage training strategy, together with a new ChartSumm dataset of $190{,}618$ samples, enables robust vision-to-language alignment and high-quality chart summaries, achieving state-of-the-art results on the Chart-to-Text Pew benchmark (e.g., BLEU-4 $=35.55$, Rouge-L $=25.79$). Ablation studies confirm the importance of each component and training stage, underscoring the value of chart-specific alignment in LVLM-based chart understanding and its potential for scalable, accurate data communication.

Abstract

Chart summarization, which focuses on extracting key information from charts and interpreting it in natural language, is crucial for generating and delivering insights through effective and accessible data analysis. Traditional methods for chart understanding and summarization often rely on multi-stage pipelines, which may produce suboptimal semantic alignment between visual and textual information. In comparison, recently developed LLM-based methods are more dependent on the capability of foundation images or languages, while ignoring the characteristics of chart data and its relevant challenges. To address these limitations, we propose ChartAdapter, a novel lightweight transformer module designed to bridge the gap between charts and textual summaries. ChartAdapter employs learnable query vectors to extract implicit semantics from chart data and incorporates a cross-modal alignment projector to enhance vision-to-language generative learning. By integrating ChartAdapter with an LLM, we enable end-to-end training and efficient chart summarization. To further enhance the training, we introduce a three-stage hierarchical training procedure and develop a large-scale dataset specifically curated for chart summarization, comprising 190,618 samples. Experimental results on the standard Chart-to-Text testing set demonstrate that our approach significantly outperforms existing methods, including state-of-the-art models, in generating high-quality chart summaries. Ablation studies further validate the effectiveness of key components in ChartAdapter. This work highlights the potential of tailored LLM-based approaches to advance chart understanding and sets a strong foundation for future research in this area.

ChartAdapter: Large Vision-Language Model for Chart Summarization

TL;DR

samples, enables robust vision-to-language alignment and high-quality chart summaries, achieving state-of-the-art results on the Chart-to-Text Pew benchmark (e.g., BLEU-4

, Rouge-L

). Ablation studies confirm the importance of each component and training stage, underscoring the value of chart-specific alignment in LVLM-based chart understanding and its potential for scalable, accurate data communication.

Abstract

Paper Structure (18 sections, 6 equations, 2 figures, 4 tables)

This paper contains 18 sections, 6 equations, 2 figures, 4 tables.

Introduction
Related Works
Methodology
ChartAdapter
Cross-Modal Projector
Latent Textual Embeddings
Cross-Modal Interaction Layer
Implicit Semantic Decoder Layer
Training Strategy
Experiments
Experimental Settings
Dataset
Baselines
Model settings
Evaluation settings
...and 3 more sections

Figures (2)

Figure 1: The overall framework of the proposed ChartAdapter integrated in a Large Vision-Language Model (LVLM). It acts as a bridge between the chart encoder and LLM through transformer-based cross-modal interaction modeling with learnable latent embeddings to extract implicit semantics relevant to charts, complemented by a cross-modal alignment projector.
Figure 2: Chart Summarization Sample Generated by ChartAdapter.

ChartAdapter: Large Vision-Language Model for Chart Summarization

TL;DR

Abstract

ChartAdapter: Large Vision-Language Model for Chart Summarization

Authors

TL;DR

Abstract

Table of Contents

Figures (2)