Table of Contents
Fetching ...

DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation

Junkai Yan, Yipeng Gao, Qize Yang, Xihan Wei, Xuansong Xie, Ancong Wu, Wei-Shi Zheng

TL;DR

DreamView tackles the challenge of viewpoint-specific customization in text-to-3D by introducing an adaptive guidance injection mechanism that jointly leverages an overall object description and view-specific texts. It develops DreamView-2D, a diffusion-based model trained on multi-view rendered data, and DreamView-3D, which distills these priors into 3D via score distillation sampling, enabling consistent yet customizable 3D generation. The approach demonstrates strong 2D and 3D performance, outperforming baselines in both customization and consistency, and is supported by extensive ablations, a user study, and demonstrations of reduced prompt burdens through LLM-based prompt generation. Overall, DreamView offers a versatile framework for artist-friendly, view-aware 3D synthesis with practical applicability and extensibility to other text-to-image and 3D pipelines.

Abstract

Text-to-3D generation, which synthesizes 3D assets according to an overall text description, has significantly progressed. However, a challenge arises when the specific appearances need customizing at designated viewpoints but referring solely to the overall description for generating 3D objects. For instance, ambiguity easily occurs when producing a T-shirt with distinct patterns on its front and back using a single overall text guidance. In this work, we propose DreamView, a text-to-image approach enabling multi-view customization while maintaining overall consistency by adaptively injecting the view-specific and overall text guidance through a collaborative text guidance injection module, which can also be lifted to 3D generation via score distillation sampling. DreamView is trained with large-scale rendered multi-view images and their corresponding view-specific texts to learn to balance the separate content manipulation in each view and the global consistency of the overall object, resulting in a dual achievement of customization and consistency. Consequently, DreamView empowers artists to design 3D objects creatively, fostering the creation of more innovative and diverse 3D assets. Code and model will be released at https://github.com/iSEE-Laboratory/DreamView.

DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation

TL;DR

DreamView tackles the challenge of viewpoint-specific customization in text-to-3D by introducing an adaptive guidance injection mechanism that jointly leverages an overall object description and view-specific texts. It develops DreamView-2D, a diffusion-based model trained on multi-view rendered data, and DreamView-3D, which distills these priors into 3D via score distillation sampling, enabling consistent yet customizable 3D generation. The approach demonstrates strong 2D and 3D performance, outperforming baselines in both customization and consistency, and is supported by extensive ablations, a user study, and demonstrations of reduced prompt burdens through LLM-based prompt generation. Overall, DreamView offers a versatile framework for artist-friendly, view-aware 3D synthesis with practical applicability and extensibility to other text-to-image and 3D pipelines.

Abstract

Text-to-3D generation, which synthesizes 3D assets according to an overall text description, has significantly progressed. However, a challenge arises when the specific appearances need customizing at designated viewpoints but referring solely to the overall description for generating 3D objects. For instance, ambiguity easily occurs when producing a T-shirt with distinct patterns on its front and back using a single overall text guidance. In this work, we propose DreamView, a text-to-image approach enabling multi-view customization while maintaining overall consistency by adaptively injecting the view-specific and overall text guidance through a collaborative text guidance injection module, which can also be lifted to 3D generation via score distillation sampling. DreamView is trained with large-scale rendered multi-view images and their corresponding view-specific texts to learn to balance the separate content manipulation in each view and the global consistency of the overall object, resulting in a dual achievement of customization and consistency. Consequently, DreamView empowers artists to design 3D objects creatively, fostering the creation of more innovative and diverse 3D assets. Code and model will be released at https://github.com/iSEE-Laboratory/DreamView.
Paper Structure (19 sections, 4 equations, 17 figures, 2 tables)

This paper contains 19 sections, 4 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Text-to-3D generation (RGB images and normal maps) of our DreamView, where users can control what and where to generate via providing an overall text description (surrounded by a black box on the figure) and view-specific texts, thus achieving customizable 3D generation. The subject, object, and the position of the object in the overall text are marked in red, blue, and green, respectively. For the two generated results on the right, we only show the overall text.
  • Figure 2: Text-to-3D generation (front and back views) of recent works, which cannot generate content strictly aligned with texts while may suffer from inconsistent problems.
  • Figure 3: The overall framework of DreamView-2D. Left: the data preparation pipeline and the data flow of DreamView-2D. The raw 3D objects from the Objaverse dataset objv are first rendered to multi-view images and captioned by BLIP-2 blip2. Finally, the view-specific texts are merged by GPT-4 gpt4 to form the overall text. With this image-text paired data, DreamView-2D, augmented by an adaptive guidance inject module, is trained to learn a trade-off between 3D consistency and customization. Right: the detail of the adaptive guidance injection module working in each U-Net block of the model, which measures the similarity between the image embedding and the two types of text embedding to determine which text guidance should be used in the current U-Net block, thus achieving an adaptive balance between the consistency and customization. A margin hyper-parameter is used in the model to control the balance.
  • Figure 4: The overall framework of DreamView-3D, which optimizes a 3D representation via score distillation sampling dreamfusion supervised by DreamView-2D, thus inheriting the consistent and customizable priors.
  • Figure 5: Left: qualitative text-to-image generation results of DreamView-2D with different margins. Right: quantitative evaluation results (CLIP image-text score) on the validation set with the margins change. As the margin gradually increases, customization will weaken while consistency will increase.
  • ...and 12 more figures