Table of Contents
Fetching ...

Vox-E: Text-guided Voxel Editing of 3D Objects

Etai Sella, Gal Fiebelman, Peter Hedman, Hadar Averbuch-Elor

TL;DR

Vox-E addresses the challenge of editing 3D objects under text prompts by learning a grid-based volumetric representation from multi-view inputs and guiding edits with a diffusion-based SDS loss. A key contribution is a 3D volumetric regularization that couples input and edited grids, combined with 3D cross-attention to localize edits, enabling both local and global geometry and appearance changes. The approach outperforms prior 3D editing methods and 2D editing baselines in preserving object identity while achieving target edits, demonstrating robust applicability to real scenes and various voxel frameworks. This work advances accessible, text-guided 3D content creation by providing a fast, explicit grid-based editing paradigm that integrates diffusion guidance with volumetric constraints.

Abstract

Large scale text-guided diffusion models have garnered significant attention due to their ability to synthesize diverse images that convey complex visual concepts. This generative power has more recently been leveraged to perform text-to-3D synthesis. In this work, we present a technique that harnesses the power of latent diffusion models for editing existing 3D objects. Our method takes oriented 2D images of a 3D object as input and learns a grid-based volumetric representation of it. To guide the volumetric representation to conform to a target text prompt, we follow unconditional text-to-3D methods and optimize a Score Distillation Sampling (SDS) loss. However, we observe that combining this diffusion-guided loss with an image-based regularization loss that encourages the representation not to deviate too strongly from the input object is challenging, as it requires achieving two conflicting goals while viewing only structure-and-appearance coupled 2D projections. Thus, we introduce a novel volumetric regularization loss that operates directly in 3D space, utilizing the explicit nature of our 3D representation to enforce correlation between the global structure of the original and edited object. Furthermore, we present a technique that optimizes cross-attention volumetric grids to refine the spatial extent of the edits. Extensive experiments and comparisons demonstrate the effectiveness of our approach in creating a myriad of edits which cannot be achieved by prior works.

Vox-E: Text-guided Voxel Editing of 3D Objects

TL;DR

Vox-E addresses the challenge of editing 3D objects under text prompts by learning a grid-based volumetric representation from multi-view inputs and guiding edits with a diffusion-based SDS loss. A key contribution is a 3D volumetric regularization that couples input and edited grids, combined with 3D cross-attention to localize edits, enabling both local and global geometry and appearance changes. The approach outperforms prior 3D editing methods and 2D editing baselines in preserving object identity while achieving target edits, demonstrating robust applicability to real scenes and various voxel frameworks. This work advances accessible, text-guided 3D content creation by providing a fast, explicit grid-based editing paradigm that integrates diffusion guidance with volumetric constraints.

Abstract

Large scale text-guided diffusion models have garnered significant attention due to their ability to synthesize diverse images that convey complex visual concepts. This generative power has more recently been leveraged to perform text-to-3D synthesis. In this work, we present a technique that harnesses the power of latent diffusion models for editing existing 3D objects. Our method takes oriented 2D images of a 3D object as input and learns a grid-based volumetric representation of it. To guide the volumetric representation to conform to a target text prompt, we follow unconditional text-to-3D methods and optimize a Score Distillation Sampling (SDS) loss. However, we observe that combining this diffusion-guided loss with an image-based regularization loss that encourages the representation not to deviate too strongly from the input object is challenging, as it requires achieving two conflicting goals while viewing only structure-and-appearance coupled 2D projections. Thus, we introduce a novel volumetric regularization loss that operates directly in 3D space, utilizing the explicit nature of our 3D representation to enforce correlation between the global structure of the original and edited object. Furthermore, we present a technique that optimizes cross-attention volumetric grids to refine the spatial extent of the edits. Extensive experiments and comparisons demonstrate the effectiveness of our approach in creating a myriad of edits which cannot be achieved by prior works.
Paper Structure (29 sections, 8 equations, 14 figures, 3 tables)

This paper contains 29 sections, 8 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Given multiview images of an object (left), our technique generates volumetric edits from target text prompts, allowing for significant geometric and appearance changes, while faithfully preserving the input object. The objects can be edited either locally (center) or globally (right), depending on the nature of the user-provided text prompt.
  • Figure 2: An overview of our approach. Given a set of posed images depicting an object, we optimize an initial feature grid (left). We then perform text-guided object editing using a generative SDS loss and a volumetric regularization, optimizing an edited grid $G_e$. To localize the edits, we optimize 3D cross-attention grids which define probability distributions over the object and the edit regions. We obtain a volumetric mask from these grids using an energy minimization problem over all the voxels. Finally, we merge the initial and edited grid to obtain a refined volumetric grid (right).
  • Figure 3: Optimizing 3D cross-attention grids for edit localization. We leverage rough 2D cross-attention maps (third column) for supervising the training of 3D cross-attention grids (fourth column). Provided with cross-attention grids associated with the edit (as demonstrated above for "christmas sweater" and "crown") and object regions, we formulate an energy minimization problem, which outputs a volumetric binary segmentation mask (fifth column). We then merge the features of the input (first column) and edited (second column) grids using this volumetric mask to obtain our final output (rightmost column). Note that warmer colors correspond to higher activations in the cross-attention maps and edited regions are colored in gray in the binary segmentation mask.
  • Figure 4: Cross-attention 2D maps and rendered 3D grids over multiple viewpoints, obtained for the token associated with the word "rollerskates" (from the "kangaroo on rollerskates" text prompt). While 2D cross-attention may yield inconsistent observations, such as high probabilities over the tail region in the rightmost column, our 3D grids can more accurately localize the region of interest (effectively smoothing out such inconsistencies).
  • Figure 5: Results obtained by our method over different objects and prompts (with the inputs displayed on the left). Please refer to the supplementary material for additional qualitative results.
  • ...and 9 more figures