Table of Contents
Fetching ...

NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation

Vikas Thamizharasan, Difan Liu, Matthew Fisher, Nanxuan Zhao, Evangelos Kalogerakis, Michal Lukac

TL;DR

NIVeL reframes text-to-vector generation by using layered neural implicit fields as a stable, topology-flexible intermediate representation. The model learns per-layer occupancy and color via Score Distillation Sampling guided by a pre-trained image diffusion model, with a two-stage initialization that emphasizes low-frequency structure before SDS fine-tuning. It yields editable, layer-decomposed vector graphics that can be converted to Bézier curves, achieving significantly higher vector quality than prior work like VectorFusion, as shown by both CLIP-based metrics and user studies. This approach enables scalable, text-guided vector generation with practical convergence times and suggests a viable pathway for diffusion-based vector synthesis through neural implicit intermediates.

Abstract

The success of denoising diffusion models in representing rich data distributions over 2D raster images has prompted research on extending them to other data representations, such as vector graphics. Unfortunately due to their variable structure and scarcity of vector training data, directly applying diffusion models on this domain remains a challenging problem. Using workarounds like optimization via Score Distillation Sampling (SDS) is also fraught with difficulty, as vector representations are non trivial to directly optimize and tend to result in implausible geometries such as redundant or self-intersecting shapes. NIVeL addresses these challenges by reinterpreting the problem on an alternative, intermediate domain which preserves the desirable properties of vector graphics -- mainly sparsity of representation and resolution-independence. This alternative domain is based on neural implicit fields expressed in a set of decomposable, editable layers. Based on our experiments, NIVeL produces text-to-vector graphics results of significantly better quality than the state-of-the-art.

NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation

TL;DR

NIVeL reframes text-to-vector generation by using layered neural implicit fields as a stable, topology-flexible intermediate representation. The model learns per-layer occupancy and color via Score Distillation Sampling guided by a pre-trained image diffusion model, with a two-stage initialization that emphasizes low-frequency structure before SDS fine-tuning. It yields editable, layer-decomposed vector graphics that can be converted to Bézier curves, achieving significantly higher vector quality than prior work like VectorFusion, as shown by both CLIP-based metrics and user studies. This approach enables scalable, text-guided vector generation with practical convergence times and suggests a viable pathway for diffusion-based vector synthesis through neural implicit intermediates.

Abstract

The success of denoising diffusion models in representing rich data distributions over 2D raster images has prompted research on extending them to other data representations, such as vector graphics. Unfortunately due to their variable structure and scarcity of vector training data, directly applying diffusion models on this domain remains a challenging problem. Using workarounds like optimization via Score Distillation Sampling (SDS) is also fraught with difficulty, as vector representations are non trivial to directly optimize and tend to result in implausible geometries such as redundant or self-intersecting shapes. NIVeL addresses these challenges by reinterpreting the problem on an alternative, intermediate domain which preserves the desirable properties of vector graphics -- mainly sparsity of representation and resolution-independence. This alternative domain is based on neural implicit fields expressed in a set of decomposable, editable layers. Based on our experiments, NIVeL produces text-to-vector graphics results of significantly better quality than the state-of-the-art.
Paper Structure (35 sections, 8 equations, 14 figures, 2 tables)

This paper contains 35 sections, 8 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: CLIP metrics computed with the clip-vit-large-patch14 pre-trained model on rasterized SVG results of NIVeL and VectorFusion jain2022vectorfusion under our two parameter settings. Both methods are optimized with DeepFloyd deepfloyd. We also show the official reported results from VectorFusion when optimized with Stable Diffusion (row "with SD"). We note their implementation is not available, thus, the number of parameters in their experiments is unknown.
  • Figure 2: Sampling raster images from diffusion models, then applying vectorization leads to implausible geometries, redundant curves, and semantically meaningless layers. Here we show the results of sampling a text-to-image diffusion model DeepFloyd deepfloyd, then applying LIVE Du:2023:IVE, a vectorizer that produces layer-decomposed SVGs. The sampled raster images often contain complex signals that are difficult to vectorize and interpet.
  • Figure 3: NIVeL's architecture: given points $\mathbf{p}$ on a 2D unit square domain, our method incorporates a MLP-based network with parameters ${\boldsymbol{\psi}}$ that predicts a set of implicit fields on this domain, each representing a geometric shape. In addition, it predicts per-shape colors $\mathbf{c}$. The representation is continuous, resolution-independent, and can easily converted to parametric curve formats (e.g, Bézier curves). To estimate the parameters, our method uses SDS-based optimization driven by a raster diffusion model conditioned on an input text prompt.
  • Figure 4: Given an input sampled image (left), we estimate the NIVeL's parameters through L2 reconstruction loss without entropy $L_{entr}$ (middle), or with entropy (right). The entropy results in a cleaner shape with delineated boundaries.
  • Figure 5: Text-to-Vector Graphics generation results. We compare generated SVG results for the input text prompt between NIVeL (ours) vs VectorFusion, at two settings involving $1K$ or $12K$ number of parameters. Our vector results contain much cleaner shape geometry across diverse topologies while VectorFusion's SVGs contain redundant, degenerate curves, and self-intersecting shapes. Our method also remains robust at a low capacity (1K parameters), faithfully capturing the abstraction of the concepts in the input text prompt.
  • ...and 9 more figures