SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion
Ximing Xing, Juncheng Hu, Jing Zhang, Dong Xu, Qian Yu
TL;DR
SVGFusion tackles scalable Text-to-SVG generation by learning a continuous latent space for vector graphics through a Vector-Pixel Fusion VAE and performing diffusion-based generation in this space with a Vector Space Diffusion Transformer conditioned on text prompts. A rendering sequence modeling strategy aligns the vector and raster construction process with human design logic, improving both visual quality and editability. The approach is trained and evaluated on SVGX-Dataset, a large, cleaned collection of ~240k high-quality SVGs, and outperforms existing optimization- and language-model-based methods across multiple metrics while offering scalable expandability by adding more VS-DiT blocks. The work introduces broader SVG primitive support, a robust data pipeline, and a diffusion-based, editable framework that promises practical vector-graphics generation at real-world scales.
Abstract
In this work, we introduce SVGFusion, a Text-to-SVG model capable of scaling to real-world SVG data without relying on text-based discrete language models or prolonged Score Distillation Sampling (SDS) optimization. The core idea of SVGFusion is to utilize a popular Text-to-Image framework to learn a continuous latent space for vector graphics. Specifically, SVGFusion comprises two key modules: a Vector-Pixel Fusion Variational Autoencoder (VP-VAE) and a Vector Space Diffusion Transformer (VS-DiT). The VP-VAE processes both SVG codes and their corresponding rasterizations to learn a continuous latent space, while the VS-DiT generates latent codes within this space based on the input text prompt. Building on the VP-VAE, we propose a novel rendering sequence modeling strategy which enables the learned latent space to capture the inherent creation logic of SVGs. This allows the model to generate SVGs with higher visual quality and more logical construction, while systematically avoiding occlusion in complex graphic compositions. Additionally, the scalability of SVGFusion can be continuously enhanced by adding more VS-DiT blocks. To effectively train and evaluate SVGFusion, we construct SVGX-Dataset, a large-scale, high-quality SVG dataset that addresses the scarcity of high-quality vector data. Extensive experiments demonstrate the superiority of SVGFusion over existing SVG generation methods, establishing a new framework for SVG content creation. Code, model, and data will be released at: https://ximinng.github.io/SVGFusionProject/
