Table of Contents
Fetching ...

SceneExpander: Expanding 3D Scenes with Free-Form Inserted Views

Zijian He, Renjie Liu, Yihao Wang, Weizhi Zhong, Huan Yuan, Kun Gai, Guangrun Wang, Guanbin Li

Abstract

World building with 3D scene representations is increasingly important for content creation, simulation, and interactive experiences, yet real workflows are inherently iterative: creators must repeatedly extend an existing scene under user control. Motivated by this research gap, we study 3D scene expansion in a user-centric workflow: starting from a real scene captured by multi-view images, we extend its coverage by inserting an additional view synthesized by a generative model. Unlike simple object editing or style transfer in a fixed scene, the inserted view is often 3D-misaligned with the original reconstruction, introducing geometry shifts, hallucinated content, or view-dependent artifacts that break global multi-view consistency. To address the challenge, we propose SceneExpander, which applies test-time adaptation to a parametric feed-forward 3D reconstruction model with two complementary distillation signals: anchor distillation stabilizes the original scene by distilling geometric cues from the captured views, while inserted-view self-distillation preserves observation-supported predictions yet adapts latent geometry and appearance to accommodate the misaligned inserted view. Experiments on ETH scenes and online data demonstrate improved expansion behavior and reconstruction quality under misalignment.

SceneExpander: Expanding 3D Scenes with Free-Form Inserted Views

Abstract

World building with 3D scene representations is increasingly important for content creation, simulation, and interactive experiences, yet real workflows are inherently iterative: creators must repeatedly extend an existing scene under user control. Motivated by this research gap, we study 3D scene expansion in a user-centric workflow: starting from a real scene captured by multi-view images, we extend its coverage by inserting an additional view synthesized by a generative model. Unlike simple object editing or style transfer in a fixed scene, the inserted view is often 3D-misaligned with the original reconstruction, introducing geometry shifts, hallucinated content, or view-dependent artifacts that break global multi-view consistency. To address the challenge, we propose SceneExpander, which applies test-time adaptation to a parametric feed-forward 3D reconstruction model with two complementary distillation signals: anchor distillation stabilizes the original scene by distilling geometric cues from the captured views, while inserted-view self-distillation preserves observation-supported predictions yet adapts latent geometry and appearance to accommodate the misaligned inserted view. Experiments on ETH scenes and online data demonstrate improved expansion behavior and reconstruction quality under misalignment.

Paper Structure

This paper contains 26 sections, 11 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Two examples of controllable 3D scene expansion via text-guided view insertion. Starting from a 3D reconstruction of captured views, the user requests an additional view beyond the observed region, which is synthesized and inserted to extend the scene. Yellow marks the view-consistent (captured) area, and red highlights newly introduced content/objects.
  • Figure 2: Challenges in 3D scene expansion via free-form view insertion. (1) A 3D-misaligned inserted view can corrupt reconstruction quality in the captured region. (2) Newly introduced content must remain consistent under novel viewpoints. Red boxes highlight ghosting artifacts caused by conflicting 3D constraints from the inserted view.
  • Figure 3: Overview of SceneExpander. (1) A feed-forward reconstructor $f_\theta$ predicts an initial 3D scene from captured views. (2) To accommodate a prompt-generated inserted view $I_g$ that may be 3D-misaligned, we adapt $f_\theta$ at test time using two distillation losses: anchor distillation on captured views and inserted-view self-distillation on $I_g$. (3) The adapted model produces an expanded 3D scene consistent with the captured region while incorporating the inserted content.
  • Figure 4: Geometry-perturbation augmentation. We show four augmentation modes: identity (original), global affine warp, blockwise piecewise-affine warp (grid $3{\times}3$), and their combination (global+block), designed to mimic pose drift and local geometric inconsistencies.
  • Figure 5: Qualitative comparison of scene expansion under misaligned insertion. We show an indoor scene from the Online collection (left) and an outdoor scene from the ETH dataset (right). The first row presents the generated inserted view (purple box) together with a subset of captured multi-view images from the original scene (blue box). Each subsequent row shows renderings at novel camera poses for different methods, illustrating both fidelity to the captured region and insertion satisfaction.
  • ...and 5 more figures