Table of Contents
Fetching ...

CraftSVG: Multi-Object Text-to-SVG Synthesis via Layout Guided Diffusion

Ayan Banerjee, Nityanand Mathur, Josep Llados, Umapada Pal, Anjan Dutta

TL;DR

CraftSVG tackles the challenge of generating multi-object vector graphics from text by integrating a layout-guided diffusion pipeline with LLM-derived scene layouts, per-box latent initialization, and semantic-aware stroke abstraction. The four-stage framework combines layout correction, masked-latent canvas initialization, diffusion-guided region synthesis, and MLP-based abstraction with perceptual alignment and opacity modulation to produce coherent SVGs that preserve enumeration and spatial relations. Across extensive qualitative and quantitative evaluations, CraftSVG outperforms CLIP-based and diffusion-based baselines in realism, prompt fidelity, and stylistic control, while revealing an area for improvement in detailed human-face rendering. The work provides a training-free, scalable approach for multi-object vector graphics with practical applications in logos, posters, and simple architectural concepts.

Abstract

Generating VectorArt from text prompts is a challenging vision task, requiring diverse yet realistic depictions of the seen as well as unseen entities. However, existing research has been mostly limited to the generation of single objects, rather than comprehensive scenes comprising multiple elements. In response, this work introduces SVGCraft, a novel end-to-end framework for the creation of vector graphics depicting entire scenes from textual descriptions. Utilizing a pre-trained LLM for layout generation from text prompts, this framework introduces a technique for producing masked latents in specified bounding boxes for accurate object placement. It introduces a fusion mechanism for integrating attention maps and employs a diffusion U-Net for coherent composition, speeding up the drawing process. The resulting SVG is optimized using a pre-trained encoder and LPIPS loss with opacity modulation to maximize similarity. Additionally, this work explores the potential of primitive shapes in facilitating canvas completion in constrained environments. Through both qualitative and quantitative assessments, SVGCraft is demonstrated to surpass prior works in abstraction, recognizability, and detail, as evidenced by its performance metrics (CLIP-T: 0.4563, Cosine Similarity: 0.6342, Confusion: 0.66, Aesthetic: 6.7832). The code will be available at https://github.com/ayanban011/SVGCraft.

CraftSVG: Multi-Object Text-to-SVG Synthesis via Layout Guided Diffusion

TL;DR

CraftSVG tackles the challenge of generating multi-object vector graphics from text by integrating a layout-guided diffusion pipeline with LLM-derived scene layouts, per-box latent initialization, and semantic-aware stroke abstraction. The four-stage framework combines layout correction, masked-latent canvas initialization, diffusion-guided region synthesis, and MLP-based abstraction with perceptual alignment and opacity modulation to produce coherent SVGs that preserve enumeration and spatial relations. Across extensive qualitative and quantitative evaluations, CraftSVG outperforms CLIP-based and diffusion-based baselines in realism, prompt fidelity, and stylistic control, while revealing an area for improvement in detailed human-face rendering. The work provides a training-free, scalable approach for multi-object vector graphics with practical applications in logos, posters, and simple architectural concepts.

Abstract

Generating VectorArt from text prompts is a challenging vision task, requiring diverse yet realistic depictions of the seen as well as unseen entities. However, existing research has been mostly limited to the generation of single objects, rather than comprehensive scenes comprising multiple elements. In response, this work introduces SVGCraft, a novel end-to-end framework for the creation of vector graphics depicting entire scenes from textual descriptions. Utilizing a pre-trained LLM for layout generation from text prompts, this framework introduces a technique for producing masked latents in specified bounding boxes for accurate object placement. It introduces a fusion mechanism for integrating attention maps and employs a diffusion U-Net for coherent composition, speeding up the drawing process. The resulting SVG is optimized using a pre-trained encoder and LPIPS loss with opacity modulation to maximize similarity. Additionally, this work explores the potential of primitive shapes in facilitating canvas completion in constrained environments. Through both qualitative and quantitative assessments, SVGCraft is demonstrated to surpass prior works in abstraction, recognizability, and detail, as evidenced by its performance metrics (CLIP-T: 0.4563, Cosine Similarity: 0.6342, Confusion: 0.66, Aesthetic: 6.7832). The code will be available at https://github.com/ayanban011/SVGCraft.
Paper Structure (26 sections, 9 equations, 33 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 9 equations, 33 figures, 6 tables, 1 algorithm.

Figures (33)

  • Figure 1: Previous methods xing2023diffsketcherjain2023vectorfusion vs Ours: While former methods are unable to generate the accurate numeration, spatial relationship, and imaginary concept, CraftSVG ensures all through the layout guidance. (Note: #strokes = 1024; stroke width = 4.0).
  • Figure 2: CraftSVG employs an LLM to generate layouts with a "background prompt", "grounding object" and their corresponding bounding boxes. Masked latents for each box, with controlled attention, ensure accurate object placement. These latents are fused to initialize the SVG canvas, which is used by a diffusion U-Net for coherent image generation ($\mathcal{I}_r$) that aligns with the layout. The final canvas is produced via two parallel-trained MLPs using perceptual alignment loss and opacity modulation, maximizing the similarity between $\mathcal{I}_r$ and $\mathcal{C}_\text{CLIPArt}$.
  • Figure 3: Iterative layout correction via a supportive error term.
  • Figure 4: SVGs with CraftSVG: 1st row depicts the abstract SVG (#strokes = 64) obtained via the two parallel MLPs. 2nd row, the (#strokes) = 1024 via optimizing opacity, and semantic awareness. 3rd row further optimizes the color to produce CLIPArt.
  • Figure 5: Strokes Abstraction: Increasing the no. of layers and neurons enhances canvas control, detail, and aesthetics, while fewer neurons and layers maintain simplicity and recognizability (No. of strokes in MLP-based abstraction is in the range of [32, 128]).
  • ...and 28 more figures