Table of Contents
Fetching ...

SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion

Ximing Xing, Juncheng Hu, Jing Zhang, Dong Xu, Qian Yu

TL;DR

SVGFusion tackles scalable Text-to-SVG generation by learning a continuous latent space for vector graphics through a Vector-Pixel Fusion VAE and performing diffusion-based generation in this space with a Vector Space Diffusion Transformer conditioned on text prompts. A rendering sequence modeling strategy aligns the vector and raster construction process with human design logic, improving both visual quality and editability. The approach is trained and evaluated on SVGX-Dataset, a large, cleaned collection of ~240k high-quality SVGs, and outperforms existing optimization- and language-model-based methods across multiple metrics while offering scalable expandability by adding more VS-DiT blocks. The work introduces broader SVG primitive support, a robust data pipeline, and a diffusion-based, editable framework that promises practical vector-graphics generation at real-world scales.

Abstract

In this work, we introduce SVGFusion, a Text-to-SVG model capable of scaling to real-world SVG data without relying on text-based discrete language models or prolonged Score Distillation Sampling (SDS) optimization. The core idea of SVGFusion is to utilize a popular Text-to-Image framework to learn a continuous latent space for vector graphics. Specifically, SVGFusion comprises two key modules: a Vector-Pixel Fusion Variational Autoencoder (VP-VAE) and a Vector Space Diffusion Transformer (VS-DiT). The VP-VAE processes both SVG codes and their corresponding rasterizations to learn a continuous latent space, while the VS-DiT generates latent codes within this space based on the input text prompt. Building on the VP-VAE, we propose a novel rendering sequence modeling strategy which enables the learned latent space to capture the inherent creation logic of SVGs. This allows the model to generate SVGs with higher visual quality and more logical construction, while systematically avoiding occlusion in complex graphic compositions. Additionally, the scalability of SVGFusion can be continuously enhanced by adding more VS-DiT blocks. To effectively train and evaluate SVGFusion, we construct SVGX-Dataset, a large-scale, high-quality SVG dataset that addresses the scarcity of high-quality vector data. Extensive experiments demonstrate the superiority of SVGFusion over existing SVG generation methods, establishing a new framework for SVG content creation. Code, model, and data will be released at: https://ximinng.github.io/SVGFusionProject/

SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion

TL;DR

SVGFusion tackles scalable Text-to-SVG generation by learning a continuous latent space for vector graphics through a Vector-Pixel Fusion VAE and performing diffusion-based generation in this space with a Vector Space Diffusion Transformer conditioned on text prompts. A rendering sequence modeling strategy aligns the vector and raster construction process with human design logic, improving both visual quality and editability. The approach is trained and evaluated on SVGX-Dataset, a large, cleaned collection of ~240k high-quality SVGs, and outperforms existing optimization- and language-model-based methods across multiple metrics while offering scalable expandability by adding more VS-DiT blocks. The work introduces broader SVG primitive support, a robust data pipeline, and a diffusion-based, editable framework that promises practical vector-graphics generation at real-world scales.

Abstract

In this work, we introduce SVGFusion, a Text-to-SVG model capable of scaling to real-world SVG data without relying on text-based discrete language models or prolonged Score Distillation Sampling (SDS) optimization. The core idea of SVGFusion is to utilize a popular Text-to-Image framework to learn a continuous latent space for vector graphics. Specifically, SVGFusion comprises two key modules: a Vector-Pixel Fusion Variational Autoencoder (VP-VAE) and a Vector Space Diffusion Transformer (VS-DiT). The VP-VAE processes both SVG codes and their corresponding rasterizations to learn a continuous latent space, while the VS-DiT generates latent codes within this space based on the input text prompt. Building on the VP-VAE, we propose a novel rendering sequence modeling strategy which enables the learned latent space to capture the inherent creation logic of SVGs. This allows the model to generate SVGs with higher visual quality and more logical construction, while systematically avoiding occlusion in complex graphic compositions. Additionally, the scalability of SVGFusion can be continuously enhanced by adding more VS-DiT blocks. To effectively train and evaluate SVGFusion, we construct SVGX-Dataset, a large-scale, high-quality SVG dataset that addresses the scarcity of high-quality vector data. Extensive experiments demonstrate the superiority of SVGFusion over existing SVG generation methods, establishing a new framework for SVG content creation. Code, model, and data will be released at: https://ximinng.github.io/SVGFusionProject/

Paper Structure

This paper contains 21 sections, 2 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Example SVGs generated by our SVGFusion. Our proposed method, SVGFusion can generate SVGs with (a) reasonable construction, (b) a clear and systematic layering structure, and (c) highly editability.
  • Figure 2: An Overview of SVGFusion. (a) The pipeline begins with the representation of SVGs, where XML-defined SVG code is converted into an SVG embedding (Sec. \ref{['sec:neural_svg']}). (b) We first train a Vector-Pixel Fusion Variational Autoencoder (VP-VAE, Sec. \ref{['sec:vp_vae']}) with a transformer-based architecture to learn a continuous latent space for SVGs by incorporating features from both SVG codes and their rendered images. (c) The Vector Space Diffusion Transformer (VS-DiT, Sec. \ref{['sec:dit']}) is then trained within the learned latent space to generate new latent codes conditioned on input text descriptions.
  • Figure 3: Illustration of the SVG embedding process. SVG code is initially converted into a matrix representation that includes geometric attributes, colors, and opacity. This matrix is subsequently mapped into a tensor via SVG embeddings.
  • Figure 4: Illustration of the Vector-Pixel Fusion Encoding. The VP-VAE encoder integrates the SVG embeddings ($Q$) with pixel embeddings ($K$, $V$) using a cross-attention layer. After processing through $L$ self-attention layers, the encoded features are mapped to a latent space, where the mean and standard deviation are computed for a probabilistic representation. A latent variable $\bm{z}$ is sampled using the reparameterization trick and then passed to the decoder for further processing. For clarity, the batch dimension $B$ has been omitted.
  • Figure 5: Qualitative Comparison of SVGFusion and Existing Text-to-SVG Methods. The target SVGs are in the emoji style. We use prompt modifiers for the optimization-based approach to encourage the appropriate style: "minimal flat 2D vector icon, emoji icon, lineal color, on a white background, trending on ArtStation." Note that although the visual quality of results generated by optimization-based methods is high, these methods face challenges in decomposing the SVGs for further editing.
  • ...and 12 more figures