Table of Contents
Fetching ...

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Juan A. Rodriguez, Haotian Zhang, Abhay Puri, Aarash Feizi, Rishav Pramanik, Pascal Wichmann, Arnab Mondal, Mohammad Reza Samsami, Rabiul Awal, Perouz Taslakian, Spandana Gella, Sai Rajeswar, David Vazquez, Christopher Pal, Marco Pedersoli

TL;DR

This work tackles inverse rendering for SVG generation by closing the loop between generated SVG code and its rendered output. It introduces RLRF, an online reinforcement learning approach that uses a composite rendering-based reward to improve visual fidelity, semantic alignment, and code efficiency, following an initial supervised fine-tuning phase (SVG-SFT). Through extensive experiments on Im2SVG and Text2SVG, RLRF achieves state-of-the-art reconstruction and compactness on complex SVGs and demonstrates robust out-of-distribution generalization. The approach underscores the value of rendering feedback in structured, code-driven visual synthesis and offers a general framework for rendering-aware post-training across vector-graphics domains.

Abstract

Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-language models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce RLRF (Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

TL;DR

This work tackles inverse rendering for SVG generation by closing the loop between generated SVG code and its rendered output. It introduces RLRF, an online reinforcement learning approach that uses a composite rendering-based reward to improve visual fidelity, semantic alignment, and code efficiency, following an initial supervised fine-tuning phase (SVG-SFT). Through extensive experiments on Im2SVG and Text2SVG, RLRF achieves state-of-the-art reconstruction and compactness on complex SVGs and demonstrates robust out-of-distribution generalization. The approach underscores the value of rendering feedback in structured, code-driven visual synthesis and offers a general framework for rendering-aware post-training across vector-graphics domains.

Abstract

Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-language models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce RLRF (Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.

Paper Structure

This paper contains 65 sections, 6 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: RLRF Overview. We present an RL approach for inverse rendering code generation tasks, focused on SVG generation in VLMs. (Left) Given a text or image input, the model generates multiple SVG rollouts, which are rendered and compared to the input to compute rewards based on reconstruction, semantics, and code efficiency. Non-differentiable steps (marked with stop signs) are handled through RL. (Right) A challenging out-of-distribution example with no ground truth SVG. While the base model (SVG-SFT) fails, RLRF enables progressive generalization, producing a meaningful SVG that captures key elements like shadows using gradients.
  • Figure 2: Im2SVG Reconstruction. Left: input pixel image. Right: rendered SVG predictions.
  • Figure 3: Text2SVG Generation. Left: input text. Right: generated SVG renderings.
  • Figure 4: Ablation on Sampling Temperature. Keeping the sampling temperature high is critical for promoting roll-out diversity. Test MSE, Reward, and SVG Length measurements consistently improve. We find that increasing the temperature up to 1.2 improves exploration, but values beyond this lead to unstable behavior and diverged outputs.
  • Figure 5: Ablation on the Number of Roll-outs. Increasing the number of roll-outs consistently improves MSE, reward, and SVG length. We report training curves, which offer clearer visibility by averaging over a large number of roll-outs.
  • ...and 11 more figures