Text2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback

Fatemeh Pesaran Zadeh; Juyeon Kim; Jin-Hwa Kim; Gunhee Kim

Text2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback

Fatemeh Pesaran Zadeh, Juyeon Kim, Jin-Hwa Kim, Gunhee Kim

TL;DR

The paper addresses the challenge of generating diverse, accurate charts from natural language and data by identifying gaps in existing chart-generation datasets and training paradigms. It introduces Text2Chart31, a hierarchical data-generation pipeline coupled with an RL-based instruction-tuning framework that uses automatic feedback via preference and alignment rewards. Across three tasks—Description-to-Chart, Raw Data-to-Chart, and Code-to-Description—the approach yields state-of-the-art performance among open-source models and competitive results with proprietary systems, particularly excelling on underrepresented chart types. The work provides a scalable dataset and a practical RL-based training recipe that improves LLM-driven data visualization, with implications for broader multi-modal instruction-tuning and chart-automation workflows.

Abstract

Large language models (LLMs) have demonstrated strong capabilities across various language tasks, notably through instruction-tuning methods. However, LLMs face challenges in visualizing complex, real-world data through charts and plots. Firstly, existing datasets rarely cover a full range of chart types, such as 3D, volumetric, and gridded charts. Secondly, supervised fine-tuning methods do not fully leverage the intricate relationships within rich datasets, including text, code, and figures. To address these challenges, we propose a hierarchical pipeline and a new dataset for chart generation. Our dataset, Text2Chart31, includes 31 unique plot types referring to the Matplotlib library, with 11.1K tuples of descriptions, code, data tables, and plots. Moreover, we introduce a reinforcement learning-based instruction tuning technique for chart generation tasks without requiring human feedback. Our experiments show that this approach significantly enhances the model performance, enabling smaller models to outperform larger open-source models and be comparable to state-of-the-art proprietary models in data visualization tasks. We make the code and dataset available at https://github.com/fatemehpesaran310/Text2Chart31.

Text2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback

TL;DR

Abstract

Paper Structure (27 sections, 2 equations, 20 figures, 8 tables)

This paper contains 27 sections, 2 equations, 20 figures, 8 tables.

Introduction
Text2Chart31 Dataset
Task Definition
Dataset Construction Pipeline
Analysis of Text2Chart31 Dataset
Instruction Tuning Approach
Supervised Fine-tuning
RL via Automatic Feedback
Experiments
Results of Description-to-Chart
Results of Raw Data-to-Chart
Results of Code-to-Description
Related Work
Conclusion
Details of Text2Chart31 Dataset
...and 12 more sections

Figures (20)

Figure 1: Illustration of the contributions of our method. (a): Existing datasets rarely cover a full range of chart types and primarily focus on QA tasks rather than chart generation. (b): Our dataset focuses on chart generation tasks and covers 31 unique plot types with tuples that combine descriptions, code, data tables, intermediate reasoning steps, and plots. (c): We further adopt RL-based instruction tuning method that leverage automated feedback and cycle consistency.
Figure 2: Illustration of our hierarchical chart generation process with an example of a single plot type. The process begins by randomly selecting a topic from a topic pool. Two instructional samples are then chosen from an instruction pool and given to GPT-3.5-turbo to generate a new instruction, which undergoes a self-evaluation process by GPT-4 for qualification. If it meets the criteria, which includes compatibility with the data points and the plot type, it is added to the instruction pool. Simultaneously, the new instruction is sent to GPT-4 for data table creation using a long data table format and code generation. Finally, the generated tuple ($x, d, c, y$) goes through a final filtering of cycle-consistency to validate the produced data point with high quality and correctness.
Figure 3: Human evaluation results on a randomly sampled subset of the test set. We compare SFT+$\text{RL}_\text{pref}$ L3I-8B and SFT CLI-13B with GPT-3.5-turbo and L3I-8B.
Figure 4: Examples from the 31 plot types in Text2Chart31 dataset, grouped into 5 chart categories.
Figure 5: Distribution of keywords within the topic pool, showcasing the diverse and balanced coverage of topics in the Text2Chart31 dataset.
...and 15 more figures

Text2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback

TL;DR

Abstract

Text2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (20)