Table of Contents
Fetching ...

TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields

Tianyu Huang, Yihan Zeng, Bowen Dong, Hang Xu, Songcen Xu, Rynson W. H. Lau, Wangmeng Zuo

TL;DR

This work proposes to inject dynamic noise into the latent space of given text prompts, i.e., Noisy Text Fields (NTFs), so that limited 3D data can be mapped to the appropriate range of textual latent space that is expanded by NTFs.

Abstract

Recent works learn 3D representation explicitly under text-3D guidance. However, limited text-3D data restricts the vocabulary scale and text control of generations. Generators may easily fall into a stereotype concept for certain text prompts, thus losing open-vocabulary generation ability. To tackle this issue, we introduce a conditional 3D generative model, namely TextField3D. Specifically, rather than using the text prompts as input directly, we suggest to inject dynamic noise into the latent space of given text prompts, i.e., Noisy Text Fields (NTFs). In this way, limited 3D data can be mapped to the appropriate range of textual latent space that is expanded by NTFs. To this end, an NTFGen module is proposed to model general text latent code in noisy fields. Meanwhile, an NTFBind module is proposed to align view-invariant image latent code to noisy fields, further supporting image-conditional 3D generation. To guide the conditional generation in both geometry and texture, multi-modal discrimination is constructed with a text-3D discriminator and a text-2.5D discriminator. Compared to previous methods, TextField3D includes three merits: 1) large vocabulary, 2) text consistency, and 3) low latency. Extensive experiments demonstrate that our method achieves a potential open-vocabulary 3D generation capability.

TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields

TL;DR

This work proposes to inject dynamic noise into the latent space of given text prompts, i.e., Noisy Text Fields (NTFs), so that limited 3D data can be mapped to the appropriate range of textual latent space that is expanded by NTFs.

Abstract

Recent works learn 3D representation explicitly under text-3D guidance. However, limited text-3D data restricts the vocabulary scale and text control of generations. Generators may easily fall into a stereotype concept for certain text prompts, thus losing open-vocabulary generation ability. To tackle this issue, we introduce a conditional 3D generative model, namely TextField3D. Specifically, rather than using the text prompts as input directly, we suggest to inject dynamic noise into the latent space of given text prompts, i.e., Noisy Text Fields (NTFs). In this way, limited 3D data can be mapped to the appropriate range of textual latent space that is expanded by NTFs. To this end, an NTFGen module is proposed to model general text latent code in noisy fields. Meanwhile, an NTFBind module is proposed to align view-invariant image latent code to noisy fields, further supporting image-conditional 3D generation. To guide the conditional generation in both geometry and texture, multi-modal discrimination is constructed with a text-3D discriminator and a text-2.5D discriminator. Compared to previous methods, TextField3D includes three merits: 1) large vocabulary, 2) text consistency, and 3) low latency. Extensive experiments demonstrate that our method achieves a potential open-vocabulary 3D generation capability.
Paper Structure (29 sections, 7 equations, 16 figures, 7 tables)

This paper contains 29 sections, 7 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: TextField3D is capable of text-to-3D and image-to-3D tasks. In particular, all the text prompts are from nichol2022pointjun2023shap, our method can generate diverse and reasonable results, exhibiting a potential open-vocabulary capability.
  • Figure 2: The overall framework of TextField3D. The NTFGen module and the NTFBind module are proposed to encode conditional latent codes in noisy text fields, which are fed to a 3D generator. Multi-modal discrimination is then exploited to supervise the generation process, including a text-3D discriminator and a text-2.5D discriminator.
  • Figure 3: The motivation of Noisy Text Field. Previous works may easily form stereotype concepts of training data but have no idea about out-of-distribution prompts. Differently, the Noisy Text Field can model the semantic correspondences among cases.
  • Figure 4: Text-to-3D generation. We use seven representative text prompts to exhibit open-vocabulary generative capability, including single nouns, compound nouns, and adjectival nouns.
  • Figure 5: Image-to-3D generation. We use a clay pot in the bottom view and a donkey in the front view as image inputs. Our network can reproduce a close novel view to the ground-truth image.
  • ...and 11 more figures