Table of Contents
Fetching ...

CLIPtortionist: Zero-shot Text-driven Deformation for Manufactured 3D Shapes

Xianghao Xu, Srinath Sridhar, Daniel Ritchie

TL;DR

A zero-shot text-driven 3D shape deformation system that deforms an input 3D mesh of a manufactured object to fit an input text description to maximize an objective function based on the widely used pre-trained vision language model CLIP.

Abstract

We propose a zero-shot text-driven 3D shape deformation system that deforms an input 3D mesh of a manufactured object to fit an input text description. To do this, our system optimizes the parameters of a deformation model to maximize an objective function based on the widely used pre-trained vision language model CLIP. We find that CLIP-based objective functions exhibit many spurious local optima; to circumvent them, we parameterize deformations using a novel deformation model called BoxDefGraph which our system automatically computes from an input mesh, the BoxDefGraph is designed to capture the object aligned rectangular/circular geometry features of most manufactured objects. We then use the CMA-ES global optimization algorithm to maximize our objective, which we find to work better than popular gradient-based optimizers. We demonstrate that our approach produces appealing results and outperforms several baselines.

CLIPtortionist: Zero-shot Text-driven Deformation for Manufactured 3D Shapes

TL;DR

A zero-shot text-driven 3D shape deformation system that deforms an input 3D mesh of a manufactured object to fit an input text description to maximize an objective function based on the widely used pre-trained vision language model CLIP.

Abstract

We propose a zero-shot text-driven 3D shape deformation system that deforms an input 3D mesh of a manufactured object to fit an input text description. To do this, our system optimizes the parameters of a deformation model to maximize an objective function based on the widely used pre-trained vision language model CLIP. We find that CLIP-based objective functions exhibit many spurious local optima; to circumvent them, we parameterize deformations using a novel deformation model called BoxDefGraph which our system automatically computes from an input mesh, the BoxDefGraph is designed to capture the object aligned rectangular/circular geometry features of most manufactured objects. We then use the CMA-ES global optimization algorithm to maximize our objective, which we find to work better than popular gradient-based optimizers. We demonstrate that our approach produces appealing results and outperforms several baselines.

Paper Structure

This paper contains 16 sections, 4 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: We present CLIPtortionist, a zero-shot text-driven 3D shape deformation method that can deform an input 3D shape according to a text description while preserving structure and geometry. (Bottom) Our method deforms shapes smoothly from a source shape (middle) to two targets specified using only text.
  • Figure 2: Meshes deformed by different deformation models to maximize a CLIP-based similarity socre. From left to right, in decreasing order of degrees of freedom (DoF): Vertex deformer, Cage deformer, BoxDefGraph (Ours), and Single Box deformer. Our BoxDefGraph deformer's output matches the text prompt better than the other deformers.
  • Figure 3: CLIPtortionist takes a 3D triangular mesh $M$ and a text prompt $T$ as input and outputs a deformed target mesh $M_T$ that fits the description in the text prompt. It first generates the deformation model BoxDefGraph $D$ by analyzing the input mesh $M$. Then the parameters of the deformation model BoxDefGraph are optimized in an iterative loop. The optimization loop starts by deforming the mesh according to the BoxDefGraph parameters $\mathbf{s}_t$. Then the deformed mesh is rendered into a collection of images $\mathcal{V}$, and each image $v$ is encoded by a pre-trained CLIP image encoder into a latent code $\mathbf{e}_v$. The input text $T$ is also encoded by a pre-trained CLIP text encoder into a latent code $\mathbf{e}_T$. The system then uses the CLIP Loss $\mathcal{L}_{\text{CLIP}}$ which is computed as the negative average cosine similarity of $e_v \in \mathcal{V}$ and $\mathbf{e}_T$. The parameters are updated by CMA-ES and the updated parameters $\mathbf{s}_{t+1}$ are used to deform the mesh in the next iteration. Once the loop terminates, the optimal parameters are applied to produce the final output deformed mesh $M_T$.
  • Figure 4: Qualitative results comparing Text2Mesh, TextDeformer, and Our method
  • Figure 5: Qualitative results comparing gradient descent optimization and CMA-ES