Table of Contents
Fetching ...

ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation

Zhengzhe Liu, Peng Dai, Ruihui Li, Xiaojuan Qi, Chi-Wing Fu

TL;DR

ISS introduces a two-stage feature-space alignment that uses 2D images as stepping stones to connect CLIP-based text and image features with a pre-trained SVR shape space, enabling text-driven 3D shape generation without paired text-shape data. Stage-1 trains a CLIP2Shape mapper to map image features to the SVR shape space; Stage-2 fine-tunes this mapper at test time using CLIP consistency between the input text and rendered views to better align text with the generated shape. A text-guided stylization module enriches outputs with novel textures and structures, extending beyond the SVR priors while remaining compatible with multiple SVR models (DVR, SS3D, GET3D, IM-Net). Experiments on ShapeNet and CO3D show ISS outperforms state-of-the-art CLIP-based baselines in fidelity and text-shape consistency, with fast inference (~85 seconds) and capabilities for diversified and stylized shapes. The approach broadens text-to-3D generation to a wider range of categories and real-world data by leveraging 2D supervision and CLIP's joint text-image embeddings.

Abstract

Text-guided 3D shape generation remains challenging due to the absence of large paired text-shape data, the substantial semantic gap between these two modalities, and the structural complexity of 3D shapes. This paper presents a new framework called Image as Stepping Stone (ISS) for the task by introducing 2D image as a stepping stone to connect the two modalities and to eliminate the need for paired text-shape data. Our key contribution is a two-stage feature-space-alignment approach that maps CLIP features to shapes by harnessing a pre-trained single-view reconstruction (SVR) model with multi-view supervisions: first map the CLIP image feature to the detail-rich shape space in the SVR model, then map the CLIP text feature to the shape space and optimize the mapping by encouraging CLIP consistency between the input text and the rendered images. Further, we formulate a text-guided shape stylization module to dress up the output shapes with novel textures. Beyond existing works on 3D shape generation from text, our new approach is general for creating shapes in a broad range of categories, without requiring paired text-shape data. Experimental results manifest that our approach outperforms the state-of-the-arts and our baselines in terms of fidelity and consistency with text. Further, our approach can stylize the generated shapes with both realistic and fantasy structures and textures.

ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation

TL;DR

ISS introduces a two-stage feature-space alignment that uses 2D images as stepping stones to connect CLIP-based text and image features with a pre-trained SVR shape space, enabling text-driven 3D shape generation without paired text-shape data. Stage-1 trains a CLIP2Shape mapper to map image features to the SVR shape space; Stage-2 fine-tunes this mapper at test time using CLIP consistency between the input text and rendered views to better align text with the generated shape. A text-guided stylization module enriches outputs with novel textures and structures, extending beyond the SVR priors while remaining compatible with multiple SVR models (DVR, SS3D, GET3D, IM-Net). Experiments on ShapeNet and CO3D show ISS outperforms state-of-the-art CLIP-based baselines in fidelity and text-shape consistency, with fast inference (~85 seconds) and capabilities for diversified and stylized shapes. The approach broadens text-to-3D generation to a wider range of categories and real-world data by leveraging 2D supervision and CLIP's joint text-image embeddings.

Abstract

Text-guided 3D shape generation remains challenging due to the absence of large paired text-shape data, the substantial semantic gap between these two modalities, and the structural complexity of 3D shapes. This paper presents a new framework called Image as Stepping Stone (ISS) for the task by introducing 2D image as a stepping stone to connect the two modalities and to eliminate the need for paired text-shape data. Our key contribution is a two-stage feature-space-alignment approach that maps CLIP features to shapes by harnessing a pre-trained single-view reconstruction (SVR) model with multi-view supervisions: first map the CLIP image feature to the detail-rich shape space in the SVR model, then map the CLIP text feature to the shape space and optimize the mapping by encouraging CLIP consistency between the input text and the rendered images. Further, we formulate a text-guided shape stylization module to dress up the output shapes with novel textures. Beyond existing works on 3D shape generation from text, our new approach is general for creating shapes in a broad range of categories, without requiring paired text-shape data. Experimental results manifest that our approach outperforms the state-of-the-arts and our baselines in terms of fidelity and consistency with text. Further, our approach can stylize the generated shapes with both realistic and fantasy structures and textures.
Paper Structure (67 sections, 5 equations, 29 figures, 8 tables)

This paper contains 67 sections, 5 equations, 29 figures, 8 tables.

Figures (29)

  • Figure 1: Our novel "Image as Stepping Stone" framework (a) is able to connect the text space (the CLIP Text feature) and the 3D shape space (the SVR feature) through our two-stage feature-space alignment, such that we can generate plausible 3D shapes from text (b) beyond the capabilities of the existing works (CLIP-Forge and Dream Fields), without requiring paired text-shape data.
  • Figure 2: Overview of our text-guided 3D shape generation framework, which has three major stages. (a) Leveraging a pre-trained SVR model, in stage-1 feature-space alignment, we train the CLIP2Shape mapper $M$ to map the CLIP image feature space $\Omega_{\text{I}}$ to shape space $\Omega_{\text{S}}$ of the SVR model with $E_\text{S}$, $E_\text{I}$ frozen, and fine-tune decoder $D$ with an additional background loss $L_{\text{bg}}$. $M$ and $D$ are trained with their own losses separately at the same time by stopping the gradients from SVR loss $L_D$ and background loss $L_{bg}$ to propagate to $M$. (b) In stage-2 feature-space alignment, we fix $D$ and fine-tune $M$ into $M'$ by encouraging CLIP consistency between input text $T$ and the rendered images at test time. (c) Last, we optimize the style of the generated shape and texture of $S$ for $T$. At the inference, we use stage 2 to generate 3D shape from $T$ and (c) is optional.
  • Figure 3: Empirical studies on the CLIP feature space for text-guided 3D shape generation.
  • Figure 4: Effect of generating shapes from the same text with/without background loss $L_\text{bg}$.
  • Figure 5: Qualitative comparisons with existing works and baselines.
  • ...and 24 more figures