Table of Contents
Fetching ...

Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation

Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, Lin Ma

TL;DR

This work tackles chart-to-code generation, a challenging multimodal task requiring dense visual understanding and structured code output. It first demonstrates a plateau in performance when solely scaling supervised fine-tuning, then introduces Multimodal Structured Reinforcement Learning (MSRL), which combines textual rule-based rewards with a render-and-compare visual reward in a two-stage RL framework guided by GRPO. A large, real-world corpus of 3 million chart-code pairs is constructed, with a filtered 33k high-quality RL subset, enabling rigorous evaluation that surpasses open-source baselines and rivals proprietary systems on ChartMimic and ReachQA. The study provides a practical route to break the SFT plateau in complex multimodal code generation by leveraging multi-granularity feedback and staged optimization, with implications for other structured output tasks in vision-language models.

Abstract

While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies that appropriately reward structured outputs. We systematically investigate the performance plateau in SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation, which substantially breaks through this plateau. We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables to mitigate simplistic patterns of prior synthetic data. Despite reaching state-of-the-art performance, our experiments show that scaling SFT data eventually hits a plateau where further increases yield negligible improvements. Our MSRL method leverages a multi-granularity structured reward system using multimodal textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details. At the visual level, model-based rewards assess structural similarity by rendering generated code into images and employing an evaluator model. We implement this within a two-stage curriculum for training stability. Results demonstrate that MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively, achieving competitive performance with advanced closed-source models.

Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation

TL;DR

This work tackles chart-to-code generation, a challenging multimodal task requiring dense visual understanding and structured code output. It first demonstrates a plateau in performance when solely scaling supervised fine-tuning, then introduces Multimodal Structured Reinforcement Learning (MSRL), which combines textual rule-based rewards with a render-and-compare visual reward in a two-stage RL framework guided by GRPO. A large, real-world corpus of 3 million chart-code pairs is constructed, with a filtered 33k high-quality RL subset, enabling rigorous evaluation that surpasses open-source baselines and rivals proprietary systems on ChartMimic and ReachQA. The study provides a practical route to break the SFT plateau in complex multimodal code generation by leveraging multi-granularity feedback and staged optimization, with implications for other structured output tasks in vision-language models.

Abstract

While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies that appropriately reward structured outputs. We systematically investigate the performance plateau in SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation, which substantially breaks through this plateau. We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables to mitigate simplistic patterns of prior synthetic data. Despite reaching state-of-the-art performance, our experiments show that scaling SFT data eventually hits a plateau where further increases yield negligible improvements. Our MSRL method leverages a multi-granularity structured reward system using multimodal textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details. At the visual level, model-based rewards assess structural similarity by rendering generated code into images and employing an evaluator model. We implement this within a two-stage curriculum for training stability. Results demonstrate that MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively, achieving competitive performance with advanced closed-source models.

Paper Structure

This paper contains 16 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The SFT plateau and RL performance gain from our experiments. The left figure illustrates that scaling SFT data from 200k to 2.8M leads to a performance plateau after exceeding 2M data points. The right figure denotes the performance gain from our proposed MSRL training strategy.
  • Figure 2: The data generation pipeline and our proposed MSRL framework. (a) Our pipeline prompts Gemini-2.5-Flash with tables from arXiv papers and example codes to generate plotting code. After execution and filtering, the chart-to-code dataset of 3M pairs is obtained. We then sample 130k of these pairs and apply three-stage filters (chart, data, vision) to curate a final, high-quality dataset of 33k examples for RL. (b) The framework of our proposed MSRL strategy. The textual reward is derived from a rule-based evaluation of the generated code across five distinct dimensions. An MLLM quantifies the visual reward based on the rendered image.
  • Figure 3: Comparison of textual reward and execution rate changes between baseline and SFT models during the RL stage.
  • Figure 4: Comparison of reward gains during RL training with various reward settings.
  • Figure 5: Showcasing charts generated by MSRL compared to proprietary and open-source MLLMs. The charts produced by MSRL align well with their ground-truth ones.