Table of Contents
Fetching ...

Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models

Ronghuan Wu, Wanchao Su, Jing Liao

TL;DR

Chat2SVG tackles the challenge of producing high-quality, semantically regular SVGs from text by combining Large Language Models for template generation with image diffusion models for geometry refinement. The method introduces an SVG-oriented prompt design, a dual-stage optimization in latent and point spaces, and an iterative natural-language editing loop to maintain semantic and visual coherence. Empirical results show superior visual fidelity, path regularity, and text alignment versus strong baselines, supported by a user study favoring Chat2SVG outputs. The work has practical impact by enabling accessible professional vector graphics creation and interactive editing, with promising avenues for further enhancement and extension to related vector formats.

Abstract

Scalable Vector Graphics (SVG) has become the de facto standard for vector graphics in digital design, offering resolution independence and precise control over individual elements. Despite their advantages, creating high-quality SVG content remains challenging, as it demands technical expertise with professional editing software and a considerable time investment to craft complex shapes. Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. Our approach first uses an LLM to generate semantically meaningful SVG templates from basic geometric primitives. Guided by image diffusion models, a dual-stage optimization pipeline refines paths in latent space and adjusts point coordinates to enhance geometric complexity. Extensive experiments show that Chat2SVG outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Additionally, our system enables intuitive editing through natural language instructions, making professional vector graphics creation accessible to all users.

Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models

TL;DR

Chat2SVG tackles the challenge of producing high-quality, semantically regular SVGs from text by combining Large Language Models for template generation with image diffusion models for geometry refinement. The method introduces an SVG-oriented prompt design, a dual-stage optimization in latent and point spaces, and an iterative natural-language editing loop to maintain semantic and visual coherence. Empirical results show superior visual fidelity, path regularity, and text alignment versus strong baselines, supported by a user study favoring Chat2SVG outputs. The work has practical impact by enabling accessible professional vector graphics creation and interactive editing, with promising avenues for further enhancement and extension to related vector formats.

Abstract

Scalable Vector Graphics (SVG) has become the de facto standard for vector graphics in digital design, offering resolution independence and precise control over individual elements. Despite their advantages, creating high-quality SVG content remains challenging, as it demands technical expertise with professional editing software and a considerable time investment to craft complex shapes. Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. Our approach first uses an LLM to generate semantically meaningful SVG templates from basic geometric primitives. Guided by image diffusion models, a dual-stage optimization pipeline refines paths in latent space and adjusts point coordinates to enhance geometric complexity. Extensive experiments show that Chat2SVG outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Additionally, our system enables intuitive editing through natural language instructions, making professional vector graphics creation accessible to all users.

Paper Structure

This paper contains 19 sections, 5 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: SVG examples generated by our Chat2SVG. We highlight some shapes to demonstrate semantic clarity and path quality.
  • Figure 2: The system pipeline of Chat2SVG. Given a text prompt, our system first leverages an LLM to generate an SVG template composed of basic geometric primitives. The rendered template is enhanced through SDEdit meng2021sdedit with ControlNet zhang2023adding to add visual details while preserving the overall composition, yielding a target image. The SVG then undergoes a dual-stage optimization process to match the target image. (1) Primitives are converted to latent embeddings through latent inversion and optimized along with their visual attributes (i.e., filling colors $c_i$, stroke properties $s_i$, and transformation matrices $T_i$). (2) Point-level optimization is performed to refine the geometric details of SVG paths.
  • Figure 3: Qualitative Comparison. (1) Methods refining open-ended strokes, i.e., CLIPDraw frans2021clipdraw and DiffSketcher xing2023diffsketcher, often produce distorted and disorganized strokes to approximate objects, presenting messy appearance and poor text alignment. (2) VectorFusion jain2022vectorfusion and SVGDreamer xing2024svgdreamer produce elements that consist of multiple jagged, irregular, and fragmented shapes, such as the body of the flamingo (first row) and the plane (third row). (3) T2V-NPR zhang2024text attempts to resolve these issues by learning a latent representation of paths and merging fragmented shapes. However, it still cannot guarantee the semantic meanings of the paths, leading to less-semantic paths such as a plane body with surrounding clouds in the third row. In contrast, our method produces SVG with superior text alignment, higher visual quality, and well-structured paths exhibiting geometric regularity and clear semantic definition.
  • Figure 4: User Study. Our Chat2SVG achieves the highest user selection ratio across all three evaluation criteria.
  • Figure 5: Qualitative results of ablation study.
  • ...and 3 more figures