Table of Contents
Fetching ...

LucidDreaming: Controllable Object-Centric 3D Generation

Zhaoning Wang, Ming Li, Chen Chen

TL;DR

LucidDreaming tackles the challenge of precise, multi-object 3D generation from text by introducing a plug-and-play pipeline that uses an LLM to derive 3D bounding boxes from prompts and combines clipped ray sampling with object-centric density bias to render per-object content within NeRF-based frameworks. The method is compatible with multiple SDS-based 3D generation backbones and can insert new objects into pre-trained NeRF scenes, addressing limitations of prior approaches that struggle with numeracy and spatial relations. It includes a new dataset of prompts with 3D bounding boxes and comprehensive evaluations (BLIP-VQA, GPT-4-V, human ratings) showing improved object placement precision and generation fidelity. Ablation studies verify the contributions of CRS, OCDB, and scene-preservation components. The work broadens accessibility of high-fidelity 3D asset generation for non-experts and provides benchmarking resources for controllable 3D generation.

Abstract

With the recent development of generative models, Text-to-3D generations have also seen significant growth, opening a door for creating video-game 3D assets from a more general public. Nonetheless, people without any professional 3D editing experience would find it hard to achieve precise control over the 3D generation, especially if there are multiple objects in the prompt, as using text to control often leads to missing objects and imprecise locations. In this paper, we present LucidDreaming as an effective pipeline capable of spatial and numerical control over 3D generation from only textual prompt commands or 3D bounding boxes. Specifically, our research demonstrates that Large Language Models (LLMs) possess 3D spatial awareness and can effectively translate textual 3D information into precise 3D bounding boxes. We leverage LLMs to get individual object information and their 3D bounding boxes as the initial step of our process. Then with the bounding boxes, We further propose clipped ray sampling and object-centric density blob bias to generate 3D objects aligning with the bounding boxes. We show that our method exhibits remarkable adaptability across a spectrum of mainstream Score Distillation Sampling-based 3D generation frameworks and our pipeline can even used to insert objects into an existing NeRF scene. Moreover, we also provide a dataset of prompts with 3D bounding boxes, benchmarking 3D spatial controllability. With extensive qualitative and quantitative experiments, we demonstrate that LucidDreaming achieves superior results in object placement precision and generation fidelity compared to current approaches, while maintaining flexibility and ease of use for non-expert users.

LucidDreaming: Controllable Object-Centric 3D Generation

TL;DR

LucidDreaming tackles the challenge of precise, multi-object 3D generation from text by introducing a plug-and-play pipeline that uses an LLM to derive 3D bounding boxes from prompts and combines clipped ray sampling with object-centric density bias to render per-object content within NeRF-based frameworks. The method is compatible with multiple SDS-based 3D generation backbones and can insert new objects into pre-trained NeRF scenes, addressing limitations of prior approaches that struggle with numeracy and spatial relations. It includes a new dataset of prompts with 3D bounding boxes and comprehensive evaluations (BLIP-VQA, GPT-4-V, human ratings) showing improved object placement precision and generation fidelity. Ablation studies verify the contributions of CRS, OCDB, and scene-preservation components. The work broadens accessibility of high-fidelity 3D asset generation for non-experts and provides benchmarking resources for controllable 3D generation.

Abstract

With the recent development of generative models, Text-to-3D generations have also seen significant growth, opening a door for creating video-game 3D assets from a more general public. Nonetheless, people without any professional 3D editing experience would find it hard to achieve precise control over the 3D generation, especially if there are multiple objects in the prompt, as using text to control often leads to missing objects and imprecise locations. In this paper, we present LucidDreaming as an effective pipeline capable of spatial and numerical control over 3D generation from only textual prompt commands or 3D bounding boxes. Specifically, our research demonstrates that Large Language Models (LLMs) possess 3D spatial awareness and can effectively translate textual 3D information into precise 3D bounding boxes. We leverage LLMs to get individual object information and their 3D bounding boxes as the initial step of our process. Then with the bounding boxes, We further propose clipped ray sampling and object-centric density blob bias to generate 3D objects aligning with the bounding boxes. We show that our method exhibits remarkable adaptability across a spectrum of mainstream Score Distillation Sampling-based 3D generation frameworks and our pipeline can even used to insert objects into an existing NeRF scene. Moreover, we also provide a dataset of prompts with 3D bounding boxes, benchmarking 3D spatial controllability. With extensive qualitative and quantitative experiments, we demonstrate that LucidDreaming achieves superior results in object placement precision and generation fidelity compared to current approaches, while maintaining flexibility and ease of use for non-expert users.
Paper Structure (47 sections, 10 equations, 22 figures, 4 tables)

This paper contains 47 sections, 10 equations, 22 figures, 4 tables.

Figures (22)

  • Figure 1: Our pipeline enables numerical and spatial controls with just textual information, while using baseline methodspoole2022dreamfusionlin2023magic3dwang2023prolificdreamer only would often fail to fully capture the controlling logic (top). At the bottom, we show an application of our pipeline generating and placing specified objects with a NeRF with user-provided bounding boxes and prompts. This is typically difficult due to NeRF's implicit representation.
  • Figure 2: A high-level overview of our pipeline, controlling prompts are decomposed into 3D bounding boxes with LLMs, such as GPT4. Then in LucidDreaming, object-centric density bias and clipped ray sampling are used with Score Distillation Sampling (SDS) loss to align the generation with the user’s control.
  • Figure 3: Clipped Ray Sampling. Points within object boxes are sampled between $t_{entry}$ and $t_{exit}$ for $\mathcal{L}{\text{SDS}}$. Outside points use $\mathcal{L}{\text{rec}}$ against frozen NeRF.
  • Figure 4: We show two toy examples in illustration of the occupancy grids with clipped ray sampling. With default uni-sphere density bias (a), the objects are either clustered to the center (top), or totally missing due to gradient vanishing (bottom), while our object-centric bias (b) aligns the object's initial density with the given bounding boxes.
  • Figure 5: Examples of controlled 3D generation. The bounding boxes and prompts are decomposed from the scene prompt with an LLM. We show our method is adaptable to multiple SDS-based 3D generation methods to generate Bounding Box-controlled 3D content. In the last row, we show the best of three baseline methods with the scene prompt. Clustered objects, missing items, and wrong spatial are the most common issues in the baseline methods. Please refer to the supplementary for more results and frameworks adapted to ours.
  • ...and 17 more figures