Table of Contents
Fetching ...

MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation

Baicheng Li, Dong Wu, Jun Li, Shunkai Zhou, Zecui Zeng, Lusong Li, Hongbin Zha

Abstract

Recent unified 3D generation models have made remarkable progress in producing high-quality 3D assets from a single image. Notably, layout-aware approaches such as SAM3D can reconstruct multiple objects while preserving their spatial arrangement, opening the door to practical scene-level 3D generation. However, current methods are limited to single-view input and cannot leverage complementary multi-view observations, while independently estimated object poses often lead to physically implausible layouts such as interpenetration and floating artifacts. We present MV-SAM3D, a training-free framework that extends layout-aware 3D generation with multi-view consistency and physical plausibility. We formulate multi-view fusion as a Multi-Diffusion process in 3D latent space and propose two adaptive weighting strategies -- attention-entropy weighting and visibility weighting -- that enable confidence-aware fusion, ensuring each viewpoint contributes according to its local observation reliability. For multi-object composition, we introduce physics-aware optimization that injects collision and contact constraints both during and after generation, yielding physically plausible object arrangements. Experiments on standard benchmarks and real-world multi-object scenes demonstrate significant improvements in reconstruction fidelity and layout plausibility, all without any additional training. Code is available at https://github.com/devinli123/MV-SAM3D.

MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation

Abstract

Recent unified 3D generation models have made remarkable progress in producing high-quality 3D assets from a single image. Notably, layout-aware approaches such as SAM3D can reconstruct multiple objects while preserving their spatial arrangement, opening the door to practical scene-level 3D generation. However, current methods are limited to single-view input and cannot leverage complementary multi-view observations, while independently estimated object poses often lead to physically implausible layouts such as interpenetration and floating artifacts. We present MV-SAM3D, a training-free framework that extends layout-aware 3D generation with multi-view consistency and physical plausibility. We formulate multi-view fusion as a Multi-Diffusion process in 3D latent space and propose two adaptive weighting strategies -- attention-entropy weighting and visibility weighting -- that enable confidence-aware fusion, ensuring each viewpoint contributes according to its local observation reliability. For multi-object composition, we introduce physics-aware optimization that injects collision and contact constraints both during and after generation, yielding physically plausible object arrangements. Experiments on standard benchmarks and real-world multi-object scenes demonstrate significant improvements in reconstruction fidelity and layout plausibility, all without any additional training. Code is available at https://github.com/devinli123/MV-SAM3D.
Paper Structure (16 sections, 9 equations, 8 figures, 6 tables)

This paper contains 16 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: MV-SAM3D enables multi-view, layout-aware 3D generation with physical plausibility. Left: A representative scene-level reconstruction, where each generated 3D object is overlaid onto the scene point cloud. Top right: Single-view generation produces hallucinated side appearance, while our adaptive multi-view fusion yields faithful reconstruction by leveraging complementary observations. Bottom right: Independent pose estimation leads to collisions, floating objects, and incorrect orientations; our physics-aware optimization produces physically plausible object layouts.
  • Figure 2: Overview of MV-SAM3D. Given multi-view images with segmentation masks and DA3-estimated pointmaps, our framework first performs per-object 3D generation by fusing flow matching velocities from each viewpoint with adaptive weighting (cross-attention entropy and geometric visibility). Multi-object composition is then achieved through layout injection during generation and post-generation pose refinement, resolving collisions, floating artifacts, and pose errors.
  • Figure 3: Attention-entropy visualization. For a plush toy observed from three viewpoints, we visualize the per-point cross-attention entropy. Regions directly visible from a given view exhibit low entropy (blue), while occluded regions show high entropy (red), confirming that attention entropy serves as a reliable implicit indicator of observation confidence.
  • Figure 4: Effect of entropy weighting. A plush toy observed from 6 views (5 frontal, 1 rear capturing the tail and a black label). Simple averaging: tail shape is wrong and the black label is missing. Entropy in Stage 1 only: correct structure emerges but label texture is white. Entropy in both stages: both structure and texture faithfully match the observation, confirming that entropy weighting is essential in both stages.
  • Figure 5: Effect of visibility weighting. A medicine box with distinct front/back textures. Entropy weighting only: front and back textures are mixed due to the symmetric structure confusing implicit matching. Entropy + visibility weighting: front and back appearances are correctly separated, with each face faithfully reflecting the observed texture.
  • ...and 3 more figures