Table of Contents
Fetching ...

AWOL: Analysis WithOut synthesis using Language

Silvia Zuffi, Michael J. Black

TL;DR

This paper addresses the challenge of generating novel 3D assets from natural language by learning a mapping from the CLIP latent space to parametric shape spaces. The approach, AWOL, combines a differentiable animal model (SMAL+) and a non-differentiable tree generator (Tree-Gen) with a Real-NVP-based Text-to-Shape model that conditions shape parameters on language or image prompts. It demonstrates that language-driven control enables interpolation and generalization to unseen species and ages, including state-of-the-art 3D dog shape estimation and the first language-driven generation of 3D trees. The work advances practical 3D asset creation by enabling text- and image-driven generation of rigged animals and detailed trees, with broad implications for animation, robotics, and biological visualization.

Abstract

Many classical parametric 3D shape models exist, but creating novel shapes with such models requires expert knowledge of their parameters. For example, imagine creating a specific type of tree using procedural graphics or a new kind of animal from a statistical shape model. Our key idea is to leverage language to control such existing models to produce novel shapes. This involves learning a mapping between the latent space of a vision-language model and the parameter space of the 3D model, which we do using a small set of shape and text pairs. Our hypothesis is that mapping from language to parameters allows us to generate parameters for objects that were never seen during training. If the mapping between language and parameters is sufficiently smooth, then interpolation or generalization in language should translate appropriately into novel 3D shapes. We test our approach with two very different types of parametric shape models (quadrupeds and arboreal trees). We use a learned statistical shape model of quadrupeds and show that we can use text to generate new animals not present during training. In particular, we demonstrate state-of-the-art shape estimation of 3D dogs. This work also constitutes the first language-driven method for generating 3D trees. Finally, embedding images in the CLIP latent space enables us to generate animals and trees directly from images.

AWOL: Analysis WithOut synthesis using Language

TL;DR

This paper addresses the challenge of generating novel 3D assets from natural language by learning a mapping from the CLIP latent space to parametric shape spaces. The approach, AWOL, combines a differentiable animal model (SMAL+) and a non-differentiable tree generator (Tree-Gen) with a Real-NVP-based Text-to-Shape model that conditions shape parameters on language or image prompts. It demonstrates that language-driven control enables interpolation and generalization to unseen species and ages, including state-of-the-art 3D dog shape estimation and the first language-driven generation of 3D trees. The work advances practical 3D asset creation by enabling text- and image-driven generation of rigged animals and detailed trees, with broad implications for animation, robotics, and biological visualization.

Abstract

Many classical parametric 3D shape models exist, but creating novel shapes with such models requires expert knowledge of their parameters. For example, imagine creating a specific type of tree using procedural graphics or a new kind of animal from a statistical shape model. Our key idea is to leverage language to control such existing models to produce novel shapes. This involves learning a mapping between the latent space of a vision-language model and the parameter space of the 3D model, which we do using a small set of shape and text pairs. Our hypothesis is that mapping from language to parameters allows us to generate parameters for objects that were never seen during training. If the mapping between language and parameters is sufficiently smooth, then interpolation or generalization in language should translate appropriately into novel 3D shapes. We test our approach with two very different types of parametric shape models (quadrupeds and arboreal trees). We use a learned statistical shape model of quadrupeds and show that we can use text to generate new animals not present during training. In particular, we demonstrate state-of-the-art shape estimation of 3D dogs. This work also constitutes the first language-driven method for generating 3D trees. Finally, embedding images in the CLIP latent space enables us to generate animals and trees directly from images.
Paper Structure (14 sections, 3 equations, 13 figures, 1 table)

This paper contains 14 sections, 3 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Generated trees and animals. AWOL learns to generate animals and trees from text and images. We show examples for text-generated tree and animal species not seen during training (except for the cat).
  • Figure 2: Training set for the tree network. From left: Poplar, Maple, Palm, Silver Birch, English Oak, European Larch, Weeping Willow, Balsam Fir, Black Tupelo, Sphere Tree, Black Oak, Hill Cherry, Sassafras, Douglas Fir, Apple, Willow, Cypress, Magnolia, Pine, Fan Palm, Quaking Aspen.
  • Figure 3: Network architecture. At training, we can consider only text as input (a), or also provide reference images (b), with about $3-10$ examples for each breed/species. At inference, we can query the text-only network with text (c), or the text-and-image network with images (d).
  • Figure 4: Dog breeds. We verify that CLIP can discriminate the dog breeds in the D-SMAL training set by running a zero-shot classification test on the images above, which achieved $100\%$ accuracy.
  • Figure 5: Horse breeds. We found with a zero-shot classification test that among the horse breeds above, CLIP can correctly recognize only for the Tinker/Shire horses (violet box) and the Icelandic/Welsh ponies (blue box).
  • ...and 8 more figures