Table of Contents
Fetching ...

IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework

Feiyu Wang, Jiayuan Yang, Zhiyuan Zhao, Da Zhang, Bingyu Li, Peng Liu, Junyu Gao

TL;DR

Experimental results demonstrate that the Introspective SVG Generation Framework achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability.

Abstract

Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing text-to-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality. To address this limitation, we propose an Introspective SVG Generation Framework (IntroSVG). At its core, the framework instantiates a unified VLM that operates in a closed loop, assuming dual roles of both generator and critic. Specifically, through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and to provide feedback on their rendered outputs; moreover, we systematically convert early-stage failures into high-quality error-correction training data, thereby enhancing model robustness. Subsequently, we leverage a high-capacity teacher VLM to construct a preference dataset and further align the generator's policy through Direct Preference Optimization (DPO). During inference, the optimized generator and critic operate collaboratively in an iterative "generate-review-refine" cycle, starting from imperfect intermediate drafts to autonomously improve output quality. Experimental results demonstrate that our method achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability. These results corroborate the effectiveness of incorporating explicit visual feedback into the generation loop.

IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework

TL;DR

Experimental results demonstrate that the Introspective SVG Generation Framework achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability.

Abstract

Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing text-to-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality. To address this limitation, we propose an Introspective SVG Generation Framework (IntroSVG). At its core, the framework instantiates a unified VLM that operates in a closed loop, assuming dual roles of both generator and critic. Specifically, through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and to provide feedback on their rendered outputs; moreover, we systematically convert early-stage failures into high-quality error-correction training data, thereby enhancing model robustness. Subsequently, we leverage a high-capacity teacher VLM to construct a preference dataset and further align the generator's policy through Direct Preference Optimization (DPO). During inference, the optimized generator and critic operate collaboratively in an iterative "generate-review-refine" cycle, starting from imperfect intermediate drafts to autonomously improve output quality. Experimental results demonstrate that our method achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability. These results corroborate the effectiveness of incorporating explicit visual feedback into the generation loop.
Paper Structure (42 sections, 3 equations, 11 figures, 9 tables)

This paper contains 42 sections, 3 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Overview of our proposed IntroSVG (Introspective SVG Generation) framework. (Left) At the core is a unified VLM that fulfills the dual roles of "Generator" (drafting SVG) and "Critic" (perceiving PNG feedback). Black arrows represent the initial generation, while green arrows denote the iterative optimization. (Right) The "generate-critique-refine" iterative loop is shown: the model generates an initial draft, self-critiques the rendered PNG, and finally revises the code based on the structured feedback. (Bottom) Visualizations demonstrate how the model autonomously improves a sketch into a high-quality SVG through iterative refinement.
  • Figure 2: Overview of the IntroSVG Framework. Our method is divided into the following stages: (Data Construction): Synthesize a mixed dataset for direct generation ($D_G^{\text{direct}}$), correction ($D_G^{\text{correction}}$), and critique ($D_C$) using an early checkpoint model and a Teacher VLM. Stage 1 (SFT): Train a unified VLM on this mixed dataset, enabling it to possess both generation and critique capabilities simultaneously. Stage 2 (DPO): Use the Teacher VLM to evaluate generated preference pairs, specifically optimizing the model's generation policy ($M_{\text{Policy}}$) via the DPO loss. Introspective inference Loop: The final single model performs a closed loop during inference: it first generates an SVG, then switches to a Critic role to "view" its rendering and assign a score. If the score is unsatisfactory, it utilizes this critique for the next round of correction.
  • Figure 3: Qualitative comparison between the proposed IntroSVG and other Text-to-SVG methods
  • Figure 4: Qualitative results of the iterative refinement loop.
  • Figure 5: Visual and code comparison before and after data standardization.
  • ...and 6 more figures