Table of Contents
Fetching ...

Progressive3D: Progressively Local Editing for Text-to-3D Content Creation with Complex Semantic Prompts

Xinhua Cheng, Tianyu Yang, Jianan Wang, Yu Li, Lei Zhang, Jian Zhang, Li Yuan

TL;DR

This work tackles semantic misalignment in text-to-3D generation when prompts describe multiple interacting objects with diverse attributes. It introduces Progressive3D, a framework that decomposes complex generation into sequential, region-constrained local edits, guided by 2D masks derived from user region prompts and reinforced by content-consistency and content-initialization constraints. A key contribution is Overlapped Semantic Component Suppression (OSCS), which isolates semantic differences between source and target prompts to minimize attribute leakage and improve edit fidelity. Evaluations on CSP-100 across NeRF, SDF, and DMTet-based pipelines show substantial gains in fine-grained semantic alignment and editing effectiveness, with ablations validating each component. The approach is general to multiple 3D representations and diffusion backbones, offering a practical pathway to precise, complex-prompt text-to-3D content creation.

Abstract

Recent text-to-3D generation methods achieve impressive 3D content creation capacity thanks to the advances in image diffusion models and optimizing strategies. However, current methods struggle to generate correct 3D content for a complex prompt in semantics, i.e., a prompt describing multiple interacted objects binding with different attributes. In this work, we propose a general framework named Progressive3D, which decomposes the entire generation into a series of locally progressive editing steps to create precise 3D content for complex prompts, and we constrain the content change to only occur in regions determined by user-defined region prompts in each editing step. Furthermore, we propose an overlapped semantic component suppression technique to encourage the optimization process to focus more on the semantic differences between prompts. Extensive experiments demonstrate that the proposed Progressive3D framework generates precise 3D content for prompts with complex semantics and is general for various text-to-3D methods driven by different 3D representations.

Progressive3D: Progressively Local Editing for Text-to-3D Content Creation with Complex Semantic Prompts

TL;DR

This work tackles semantic misalignment in text-to-3D generation when prompts describe multiple interacting objects with diverse attributes. It introduces Progressive3D, a framework that decomposes complex generation into sequential, region-constrained local edits, guided by 2D masks derived from user region prompts and reinforced by content-consistency and content-initialization constraints. A key contribution is Overlapped Semantic Component Suppression (OSCS), which isolates semantic differences between source and target prompts to minimize attribute leakage and improve edit fidelity. Evaluations on CSP-100 across NeRF, SDF, and DMTet-based pipelines show substantial gains in fine-grained semantic alignment and editing effectiveness, with ablations validating each component. The approach is general to multiple 3D representations and diffusion backbones, offering a practical pathway to precise, complex-prompt text-to-3D content creation.

Abstract

Recent text-to-3D generation methods achieve impressive 3D content creation capacity thanks to the advances in image diffusion models and optimizing strategies. However, current methods struggle to generate correct 3D content for a complex prompt in semantics, i.e., a prompt describing multiple interacted objects binding with different attributes. In this work, we propose a general framework named Progressive3D, which decomposes the entire generation into a series of locally progressive editing steps to create precise 3D content for complex prompts, and we constrain the content change to only occur in regions determined by user-defined region prompts in each editing step. Furthermore, we propose an overlapped semantic component suppression technique to encourage the optimization process to focus more on the semantic differences between prompts. Extensive experiments demonstrate that the proposed Progressive3D framework generates precise 3D content for prompts with complex semantics and is general for various text-to-3D methods driven by different 3D representations.
Paper Structure (25 sections, 14 equations, 19 figures, 5 tables)

This paper contains 25 sections, 14 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Conception. Current text-to-3D methods suffer from challenges when given prompts describing multiple objects binding with different attributes. Compared to (a) generating with existing methods, (b) generating with Progressive3D produces 3D content consistent with given prompts.
  • Figure 2: Overview of a local editing step of our proposed Progressive3D. Given a source representation $\boldsymbol{\phi}_s$ supervised by source prompt $\boldsymbol{y}_s$, our framework aims to generate a target representation $\boldsymbol{\phi}_t$ conforming to the input target prompt $\boldsymbol{y}_t$ in 3d space defined by the region prompt $\boldsymbol{y}_b$. Conditioned on the 2D mask $\boldsymbol{M}_t(\boldsymbol{r})$, we constrain the 3D content with $\mathcal{L}_{consist}$ and $\mathcal{L}_{inital}$. We further propose an Overlapped Semantic Component Suppression technique to impose the optimization focusing more on the semantic difference for precise progressive creation.
  • Figure 2: Quantitative ablation studies for proposed constraints and the OSCS technique based on DreamTime over CSP-100.
  • Figure 3: Qualitative ablations. The source prompt $\boldsymbol{y}_s$="A medieval soldier with metal armor holding a golden axe." and the target prompt $\boldsymbol{y}_t$="A medieval soldier with metal armor holding a golden axeand riding a terracotta wolf.", where green denotes the overlapped prompt and red denotes the different prompt.
  • Figure 4: Current text-to-3D methods often fail to produce precise results when the given prompt describes multiple interacted objects binding with different attributes, leading to significant issues including object missing, attribute mismatching, and quality reduction.
  • ...and 14 more figures