Table of Contents
Fetching ...

SynChart: Synthesizing Charts from Language Models

Mengchen Liu, Qixiu Li, Dongdong Chen, Dong Chen, Jianmin Bao, Yunsheng Li

TL;DR

This work tackles chart understanding by building SynChart, a large-scale synthetic chart dataset generated entirely from LLMs. It employs a three-stage data-generation pipeline to create diverse data tables, engine-generated chart code, and QA pairs, resulting in ~4 million chart images with ~75 million dense annotations. A 4.2B chart-expert model trained on SynChart achieves near GPT-4O performance on the ChartQA benchmark and exceeds GPT-4V, underscoring the value of synthetic data for specialized multi-modality tasks. The study also demonstrates strong scaling properties and significant ablations, suggesting that high-quality synthetic datasets can rival or surpass large public-domain data sources for domain-specific chart understanding.

Abstract

With the release of GPT-4V(O), its use in generating pseudo labels for multi-modality tasks has gained significant popularity. However, it is still a secret how to build such advanced models from its base large language models (LLMs). This work explores the potential of using LLMs alone for data generation and develop competitive multi-modality models focusing on chart understanding. We construct a large-scale chart dataset, SynChart, which contains approximately 4 million diverse chart images with over 75 million dense annotations, including data tables, code, descriptions, and question-answer sets. We trained a 4.2B chart-expert model using this dataset and achieve near-GPT-4O performance on the ChartQA task, surpassing GPT-4V.

SynChart: Synthesizing Charts from Language Models

TL;DR

This work tackles chart understanding by building SynChart, a large-scale synthetic chart dataset generated entirely from LLMs. It employs a three-stage data-generation pipeline to create diverse data tables, engine-generated chart code, and QA pairs, resulting in ~4 million chart images with ~75 million dense annotations. A 4.2B chart-expert model trained on SynChart achieves near GPT-4O performance on the ChartQA benchmark and exceeds GPT-4V, underscoring the value of synthetic data for specialized multi-modality tasks. The study also demonstrates strong scaling properties and significant ablations, suggesting that high-quality synthetic datasets can rival or surpass large public-domain data sources for domain-specific chart understanding.

Abstract

With the release of GPT-4V(O), its use in generating pseudo labels for multi-modality tasks has gained significant popularity. However, it is still a secret how to build such advanced models from its base large language models (LLMs). This work explores the potential of using LLMs alone for data generation and develop competitive multi-modality models focusing on chart understanding. We construct a large-scale chart dataset, SynChart, which contains approximately 4 million diverse chart images with over 75 million dense annotations, including data tables, code, descriptions, and question-answer sets. We trained a 4.2B chart-expert model using this dataset and achieve near-GPT-4O performance on the ChartQA task, surpassing GPT-4V.
Paper Structure (25 sections, 4 figures, 7 tables)

This paper contains 25 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: ChartQA accuracy highlighting different component contributions.
  • Figure 2: Data generation pipeline.
  • Figure 3: Chart image sample used in pretraining.
  • Figure 4: Chart image sample from Obelics.