Table of Contents
Fetching ...

ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, Ziwei Liu

TL;DR

ComboVerse tackles compositional 3D asset creation from a single image by diagnosing a multi-object gap in existing models and implementing a two-stage pipeline: independent single-object reconstruction followed by a spatially-guided object-composition stage. The core innovation is spatially-aware diffusion guidance (SSDS), which reweights attention on spatial relation tokens to improve object placement while preserving geometry and texture. Across a 100-image benchmark, ComboVerse outperforms state-of-the-art baselines in semantic and GPT-based alignment and is validated through user studies and scene reconstruction demonstrations. Limitations include handling mostly scenes with fewer than five objects and reliance on backbones for geometry/texture optimization, pointing to future improvements with stronger backbones and further geometry refinement.

Abstract

Generating high-quality 3D assets from a given image is highly desirable in various applications such as AR/VR. Recent advances in single-image 3D generation explore feed-forward models that learn to infer the 3D model of an object without optimization. Though promising results have been achieved in single object generation, these methods often struggle to model complex 3D assets that inherently contain multiple objects. In this work, we present ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models. 1) We first perform an in-depth analysis of this ``multi-object gap'' from both model and data perspectives. 2) Next, with reconstructed 3D models of different objects, we seek to adjust their sizes, rotation angles, and locations to create a 3D asset that matches the given image. 3) To automate this process, we apply spatially-aware score distillation sampling (SSDS) from pretrained diffusion models to guide the positioning of objects. Our proposed framework emphasizes spatial alignment of objects, compared with standard score distillation sampling, and thus achieves more accurate results. Extensive experiments validate ComboVerse achieves clear improvements over existing methods in generating compositional 3D assets.

ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

TL;DR

ComboVerse tackles compositional 3D asset creation from a single image by diagnosing a multi-object gap in existing models and implementing a two-stage pipeline: independent single-object reconstruction followed by a spatially-guided object-composition stage. The core innovation is spatially-aware diffusion guidance (SSDS), which reweights attention on spatial relation tokens to improve object placement while preserving geometry and texture. Across a 100-image benchmark, ComboVerse outperforms state-of-the-art baselines in semantic and GPT-based alignment and is validated through user studies and scene reconstruction demonstrations. Limitations include handling mostly scenes with fewer than five objects and reliance on backbones for geometry/texture optimization, pointing to future improvements with stronger backbones and further geometry refinement.

Abstract

Generating high-quality 3D assets from a given image is highly desirable in various applications such as AR/VR. Recent advances in single-image 3D generation explore feed-forward models that learn to infer the 3D model of an object without optimization. Though promising results have been achieved in single object generation, these methods often struggle to model complex 3D assets that inherently contain multiple objects. In this work, we present ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models. 1) We first perform an in-depth analysis of this ``multi-object gap'' from both model and data perspectives. 2) Next, with reconstructed 3D models of different objects, we seek to adjust their sizes, rotation angles, and locations to create a 3D asset that matches the given image. 3) To automate this process, we apply spatially-aware score distillation sampling (SSDS) from pretrained diffusion models to guide the positioning of objects. Our proposed framework emphasizes spatial alignment of objects, compared with standard score distillation sampling, and thus achieves more accurate results. Extensive experiments validate ComboVerse achieves clear improvements over existing methods in generating compositional 3D assets.
Paper Structure (14 sections, 10 equations, 12 figures, 2 tables)

This paper contains 14 sections, 10 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: ComboVerse can generate high-quality 3D models from a single image that contains multiple objects, e.g., a squirrel sitting on a paper box. We show textured meshes of created 3D content, showcasing stunning reconstruction quality.
  • Figure 2: "Multi-object gap" of models trained on Objaverse. (a) Camera Setting Bias. The reconstruction quality for small and non-centered objects will significantly downgrade compared to separate reconstruction. (b) Occlusion. The reconstruction results tend to blend when an object is occluded by another. (c) Leaking Pattern. The shape and texture of an object will be influenced by other objects in the input image. For example, in (c), the tiger's back face adopts the owl's color, and its back surface becomes convex instead of concave due to the owl's shape influence.
  • Figure 3: Overview of our method. Given an input image that contains multiple objects, our method can generate high-quality 3D assets through a two-stage process. In the single-object reconstruction stage, we decompose every single object in the image with object inpainting, and perform single-image reconstruction to create individual 3D models. In the multi-object combination stage, we maintain the geometry and texture of each object while optimizing their scale, rotation, and translation parameters $\{ s_i, r_i, t_i\}$. This optimization process is guided by our proposed spatially-aware SDS loss $\mathcal{L}_{\mathrm{SSDS}}$, calculated on novel views, emphasizing the spatial token by enhancing its attention map weight. For example, considering the prompt "A fox lying on a toolbox.” given to the 2D diffusion model, we emphasize the spatial token "lying" by multiplying its attention map with a constant $c$ ($c>1$). Also, we utilize the reference loss $\mathcal{L}_{\mathrm{Ref}}$, calculated on a reference view for additional constraints.
  • Figure 4: Objects decomposition and inpainting. In this stage, given an input image, we segment each separate object and get segmented objects with noise background image $I_{i}$ and bounding-aware mask $m_i$, then $I_{i}$ and $m_i$ are input to Stable Diffusion to obtain the inpainted objects $\hat{I}_i$.
  • Figure 5: 2D toy examples. We randomly initialize the squirrel with two different initial positions (left), and optimize the position parameters to match the prompt "a squirrel is sitting on a box". Compared to standard SDS, spatially-aware SDS produces better results.
  • ...and 7 more figures