Table of Contents
Fetching ...

MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction

Zitian Tang, Xu Zhang, Jianbo Yuan, Yang Zou, Varad Gunjal, Songyao Jiang, Davide Modolo

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated promising capabilities in multimodal coding tasks such as chart-to-code generation. However, existing methods primarily rely on supervised fine-tuning (SFT), which requires the model to learn code patterns through chart-code pairs but does not expose the model to a code execution environment. Moreover, while self-correction through execution feedback offers a potential route to improve coding quality, even state-of-the-art MLLMs have been shown to struggle with effective self-correction. In this work, we introduce MM-ReCoder, a chart-to-code generation model trained with reinforcement learning (RL) and equipped with self-correction ability. We propose a two-stage multi-turn self-correction RL strategy based on Group Relative Policy Optimization (GRPO). The first stage enhances the model's self-correction ability via rolling out a shared first turn, while the second stage improves the coding capability with full-trajectory optimization. MM-ReCoder learns to produce more accurate and executable code through the interaction with the environment and by iteratively correcting its own outputs. Our results on three chart-to-code benchmarks demonstrate the state-of-the-art performance of MM-ReCoder.

MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated promising capabilities in multimodal coding tasks such as chart-to-code generation. However, existing methods primarily rely on supervised fine-tuning (SFT), which requires the model to learn code patterns through chart-code pairs but does not expose the model to a code execution environment. Moreover, while self-correction through execution feedback offers a potential route to improve coding quality, even state-of-the-art MLLMs have been shown to struggle with effective self-correction. In this work, we introduce MM-ReCoder, a chart-to-code generation model trained with reinforcement learning (RL) and equipped with self-correction ability. We propose a two-stage multi-turn self-correction RL strategy based on Group Relative Policy Optimization (GRPO). The first stage enhances the model's self-correction ability via rolling out a shared first turn, while the second stage improves the coding capability with full-trajectory optimization. MM-ReCoder learns to produce more accurate and executable code through the interaction with the environment and by iteratively correcting its own outputs. Our results on three chart-to-code benchmarks demonstrate the state-of-the-art performance of MM-ReCoder.

Paper Structure

This paper contains 34 sections, 5 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Training pipeline of MM-ReCoder. We first conduct two stages of cold start: (a) we first train the model on ground truth chart-code pairs with SFT, then (b) we construct self-correction data with Qwen3VL-235B qwen3vl_blog, filter the successful ones, and train our model on the filtered data. After cold start, we conduct two stages of reinforcement learning: (c) we first enhance the model's self-correction capability in the second turn via shared-first-turn optimization, then (d) we optimize the two turns jointly to improve the coding ability.
  • Figure 2: Result of a model trained solely with rule-based reward. The model receives a full rule-based reward though the texts are overlapped. But the model-based reward can penalize this chart.
  • Figure 3: Qualitative results comparing ground truth, $1^{st}$-round generation, and $2^{nd}$-round self-correction (left to right).
  • Figure A1: Training curves of MM-ReCoder. Curves on the training set are smoothed with a window size of 20 steps. We evaluate the model on ChartMimic every 40 steps. Note that the value of the rule-based reward can be slightly different from the reported model performance on low-level score because the latter is evaluated under the official evaluation codebase of ChartMimic.
  • Figure A2: Qualitative result of MM-ReCoder for self-correction. The model removes the text labels from the pie chart in the second turn.
  • ...and 8 more figures