VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning
Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, Lin Ma
TL;DR
VinciCoder introduces a unified multimodal code-generation framework trained with a two-stage SFT-ViRL workflow to handle visual inputs and generate executable code across diverse domains. The core novelty is a coarse-to-fine visual reinforcement learning objective that uses DINOv2-L embeddings and a GRPO-based policy update, coupled with a language-alignment reward to ensure prompt-language fidelity. Large-scale SFT data (1.6M samples) establishes a robust foundation, while the ViRL stage optimizes both executability and visual fidelity, achieving state-of-the-art results on open-source benchmarks and demonstrating strong generalization across chart-to-code, web-to-HTML, image-to-SVG, and image-to-LaTeX tasks. Extensive ablations validate the necessity of the refinement data, the coarse-to-fine reward, and the two-stage training paradigm, highlighting the model’s potential for domain-agnostic multimodal code synthesis.
Abstract
Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like chart-to-code generation, their reliance on single-task training regimens fosters a narrow paradigm that hinders the development of generalized \textbf{VI}sio\textbf{N} \textbf{C}ode \textbf{I}ntelligence. In this work, we introduce \textbf{VinciCoder}, a unified multimodal code generation model that addresses this limitation via a two-stage training framework. We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs for tasks involving direct code generation and visual-based code refinement. Subsequently, we introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity by calculating visual similarity across local and global image patches. Extensive experiments on diverse multimodal code generation benchmarks demonstrate that VinciCoder achieves state-of-the-art performance, surpassing recent open-source models. The ablation study further validates the effectiveness of our proposed coarse-to-fine ViRL strategy. The data, code and model is available at https://github.com/DocTron-hub/VinciCoder.
