Table of Contents
Fetching ...

Retrieval-Augmented Score Distillation for Text-to-3D Generation

Junyoung Seo, Susung Hong, Wooseok Jang, Inès Hyeonsu Kim, Minseop Kwak, Doyup Lee, Seungryong Kim

TL;DR

This work tackles 3D geometry inconsistencies in text-to-3D generation stemming from limited high-quality 3D data. It introduces ReDream, a retrieval-augmented score distillation framework that uses semantically aligned 3D assets as geometric priors and applies lightweight adaptation to the 2D diffusion prior, improving view consistency and texture fidelity. The method formalizes a retrieval-informed variational objective and employs a Wasserstein gradient flow to update particle representations, achieving better geometry and texture without full retuning of large diffusion models. Empirical results on Objaverse-based retrieval show improved CLIP alignment and view-consistency (lower A-LPIPS), with strong human preferences and efficient test-time retrieval and adaptation. Overall, ReDream provides a practical path to robust, controllable 3D content generation by combining retrieval guidance with minimal 2D prior modification.

Abstract

Text-to-3D generation has achieved significant success by incorporating powerful 2D diffusion models, but insufficient 3D prior knowledge also leads to the inconsistency of 3D geometry. Recently, since large-scale multi-view datasets have been released, fine-tuning the diffusion model on the multi-view datasets becomes a mainstream to solve the 3D inconsistency problem. However, it has confronted with fundamental difficulties regarding the limited quality and diversity of 3D data, compared with 2D data. To sidestep these trade-offs, we explore a retrieval-augmented approach tailored for score distillation, dubbed ReDream. We postulate that both expressiveness of 2D diffusion models and geometric consistency of 3D assets can be fully leveraged by employing the semantically relevant assets directly within the optimization process. To this end, we introduce novel framework for retrieval-based quality enhancement in text-to-3D generation. We leverage the retrieved asset to incorporate its geometric prior in the variational objective and adapt the diffusion model's 2D prior toward view consistency, achieving drastic improvements in both geometry and fidelity of generated scenes. We conduct extensive experiments to demonstrate that ReDream exhibits superior quality with increased geometric consistency. Project page is available at https://ku-cvlab.github.io/ReDream/.

Retrieval-Augmented Score Distillation for Text-to-3D Generation

TL;DR

This work tackles 3D geometry inconsistencies in text-to-3D generation stemming from limited high-quality 3D data. It introduces ReDream, a retrieval-augmented score distillation framework that uses semantically aligned 3D assets as geometric priors and applies lightweight adaptation to the 2D diffusion prior, improving view consistency and texture fidelity. The method formalizes a retrieval-informed variational objective and employs a Wasserstein gradient flow to update particle representations, achieving better geometry and texture without full retuning of large diffusion models. Empirical results on Objaverse-based retrieval show improved CLIP alignment and view-consistency (lower A-LPIPS), with strong human preferences and efficient test-time retrieval and adaptation. Overall, ReDream provides a practical path to robust, controllable 3D content generation by combining retrieval guidance with minimal 2D prior modification.

Abstract

Text-to-3D generation has achieved significant success by incorporating powerful 2D diffusion models, but insufficient 3D prior knowledge also leads to the inconsistency of 3D geometry. Recently, since large-scale multi-view datasets have been released, fine-tuning the diffusion model on the multi-view datasets becomes a mainstream to solve the 3D inconsistency problem. However, it has confronted with fundamental difficulties regarding the limited quality and diversity of 3D data, compared with 2D data. To sidestep these trade-offs, we explore a retrieval-augmented approach tailored for score distillation, dubbed ReDream. We postulate that both expressiveness of 2D diffusion models and geometric consistency of 3D assets can be fully leveraged by employing the semantically relevant assets directly within the optimization process. To this end, we introduce novel framework for retrieval-based quality enhancement in text-to-3D generation. We leverage the retrieved asset to incorporate its geometric prior in the variational objective and adapt the diffusion model's 2D prior toward view consistency, achieving drastic improvements in both geometry and fidelity of generated scenes. We conduct extensive experiments to demonstrate that ReDream exhibits superior quality with increased geometric consistency. Project page is available at https://ku-cvlab.github.io/ReDream/.
Paper Structure (44 sections, 20 equations, 19 figures, 3 tables)

This paper contains 44 sections, 20 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Our framework enables to high-quality generation of 3D contentsby leveraging retrieved assets from external databases, achieving significant enhancement of robust geometric consistency, as demonstrated in (a), and also enhancement of detail and fidelity, as shown in (b), without being bounded by the textural quality of the 3D assets.
  • Figure 2: Overview. Given a prompt $c$, we retrieve the nearest neighboring assets from the 3D database. With these assets, we perform initialization of an variational distribution for incorporation of robust 3D geometric prior, as well as conducting lightweight adaptation of 2D prior model for equalize probability density across viewpoints.
  • Figure 3: Generated results and corresponding nearest asset. The first row shows the first nearest neighbor from the retrieved assets, with the renderings of corresponding particles from the given texts displayed below.
  • Figure 4: Lightweight adaptation of 2D diffusion models. We compare the effectiveness of the adaptation with given rendering from a 3D asset in (a). We linearly interpolate a text embedding from "a back view of an angry cat" to "a front view of an angry cat" through "side view". (b) 2D samples from the prior model. (c) 2D samples from the adapted prior model with learned view prefixes. Compared with (b). The samples from adapted 2D prior in (c) reflect a variety of viewpoints, not biased towards a single viewpoint.
  • Figure 5: 3D Dataset retrieval. (a) and (b) show retrieved top-$K$ nearest neighbors on CLIP-text embedding space and CLIP-image embedding space, respectively.
  • ...and 14 more figures