Table of Contents
Fetching ...

VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning

Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, Lin Ma

TL;DR

VinciCoder introduces a unified multimodal code-generation framework trained with a two-stage SFT-ViRL workflow to handle visual inputs and generate executable code across diverse domains. The core novelty is a coarse-to-fine visual reinforcement learning objective that uses DINOv2-L embeddings and a GRPO-based policy update, coupled with a language-alignment reward to ensure prompt-language fidelity. Large-scale SFT data (1.6M samples) establishes a robust foundation, while the ViRL stage optimizes both executability and visual fidelity, achieving state-of-the-art results on open-source benchmarks and demonstrating strong generalization across chart-to-code, web-to-HTML, image-to-SVG, and image-to-LaTeX tasks. Extensive ablations validate the necessity of the refinement data, the coarse-to-fine reward, and the two-stage training paradigm, highlighting the model’s potential for domain-agnostic multimodal code synthesis.

Abstract

Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like chart-to-code generation, their reliance on single-task training regimens fosters a narrow paradigm that hinders the development of generalized \textbf{VI}sio\textbf{N} \textbf{C}ode \textbf{I}ntelligence. In this work, we introduce \textbf{VinciCoder}, a unified multimodal code generation model that addresses this limitation via a two-stage training framework. We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs for tasks involving direct code generation and visual-based code refinement. Subsequently, we introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity by calculating visual similarity across local and global image patches. Extensive experiments on diverse multimodal code generation benchmarks demonstrate that VinciCoder achieves state-of-the-art performance, surpassing recent open-source models. The ablation study further validates the effectiveness of our proposed coarse-to-fine ViRL strategy. The data, code and model is available at https://github.com/DocTron-hub/VinciCoder.

VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning

TL;DR

VinciCoder introduces a unified multimodal code-generation framework trained with a two-stage SFT-ViRL workflow to handle visual inputs and generate executable code across diverse domains. The core novelty is a coarse-to-fine visual reinforcement learning objective that uses DINOv2-L embeddings and a GRPO-based policy update, coupled with a language-alignment reward to ensure prompt-language fidelity. Large-scale SFT data (1.6M samples) establishes a robust foundation, while the ViRL stage optimizes both executability and visual fidelity, achieving state-of-the-art results on open-source benchmarks and demonstrating strong generalization across chart-to-code, web-to-HTML, image-to-SVG, and image-to-LaTeX tasks. Extensive ablations validate the necessity of the refinement data, the coarse-to-fine reward, and the two-stage training paradigm, highlighting the model’s potential for domain-agnostic multimodal code synthesis.

Abstract

Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like chart-to-code generation, their reliance on single-task training regimens fosters a narrow paradigm that hinders the development of generalized \textbf{VI}sio\textbf{N} \textbf{C}ode \textbf{I}ntelligence. In this work, we introduce \textbf{VinciCoder}, a unified multimodal code generation model that addresses this limitation via a two-stage training framework. We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs for tasks involving direct code generation and visual-based code refinement. Subsequently, we introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity by calculating visual similarity across local and global image patches. Extensive experiments on diverse multimodal code generation benchmarks demonstrate that VinciCoder achieves state-of-the-art performance, surpassing recent open-source models. The ablation study further validates the effectiveness of our proposed coarse-to-fine ViRL strategy. The data, code and model is available at https://github.com/DocTron-hub/VinciCoder.

Paper Structure

This paper contains 28 sections, 12 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: VinciCoder is a unified multimodal code generation model built upon the QwenVL series via a two-stage SFT-ViRL training strategy. This approach enables VinciCoder to process visual inputs and generate corresponding code snippets.
  • Figure 2: Our training dataset is constructed via a multi-stage pipeline. We begin by curating a diverse corpus from open-source datasets, employing rigorous filtering and diversity-aware sampling. Subsequently, we enhance the data via two parallel streams: refining existing samples through execution, validation, and optimization, while generating novel ones for the refinement task. This dual strategy yields the final high-quality data pairs for our SFT and RL training.
  • Figure 3: An overview of our coarse-to-fine ViRL strategy. Given an image with instructions, the model generates 8 code rollouts. Each code snippet is first evaluated for a language alignment reward and then rendered into an image. This image is partitioned into local patches (fine-grained) and a downsampled global thumbnail (coarse-grained). The final visual reward is the average cosine similarity between the DINOv2 embeddings of these rendered components and their counterparts from the target image.
  • Figure 4: The reward progression during our ViRL training stage. The learning curves illustrate that as training progresses, the visual reward steadily increases, while the alignment reward rapidly converges to and then plateaus at its maximum value of 1.
  • Figure 5: The ablation study about SFT and RL training stage.
  • ...and 6 more figures