Table of Contents
Fetching ...

FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis

Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, Chongxuan Li

TL;DR

FlexWorld tackles the problem of generating flexible-view 3D content from a single image, addressing the lack of 3D data by combining a video-to-video diffusion module with a progressive, geometry-aware 3D scene expansion. The V2V diffusion model is fine-tuned on depth-estimated training pairs to produce high-quality novel views under large camera variations, while the 3D scene expansion progressively integrates new content via dense depth and 3D Gaussian splatting representations. A refinement stage and camera-trajectory planning ensure geometric coherence and high visual fidelity during 360° rotations and zooming. Across RealEstate10K and Tanks datasets, FlexWorld outperforms state-of-the-art baselines on both novel-view synthesis and 3D scene generation, demonstrating strong 3D consistency and flexible viewpoint capabilities from monocular input.

Abstract

Generating flexible-view 3D scenes, including 360° rotation and zooming, from single images is challenging due to a lack of 3D data. To this end, we introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360° rotations and zooming. Project page: https://ml-gsai.github.io/FlexWorld.

FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis

TL;DR

FlexWorld tackles the problem of generating flexible-view 3D content from a single image, addressing the lack of 3D data by combining a video-to-video diffusion module with a progressive, geometry-aware 3D scene expansion. The V2V diffusion model is fine-tuned on depth-estimated training pairs to produce high-quality novel views under large camera variations, while the 3D scene expansion progressively integrates new content via dense depth and 3D Gaussian splatting representations. A refinement stage and camera-trajectory planning ensure geometric coherence and high visual fidelity during 360° rotations and zooming. Across RealEstate10K and Tanks datasets, FlexWorld outperforms state-of-the-art baselines on both novel-view synthesis and 3D scene generation, demonstrating strong 3D consistency and flexible viewpoint capabilities from monocular input.

Abstract

Generating flexible-view 3D scenes, including 360° rotation and zooming, from single images is challenging due to a lack of 3D data. To this end, we introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360° rotations and zooming. Project page: https://ml-gsai.github.io/FlexWorld.

Paper Structure

This paper contains 19 sections, 4 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: FlexWorld generates high-quality videos with camera control and flexible-view 3D scenes progressively. (a) FlexWorld introduces a V2V diffusion producing high-quality videos from incomplete scene renderings given diverse camera trajectories with large variation. (b) FlexWorld progressively generates flexible-views (e.g., 360° rotations and zooming) 3DGS scenes via the V2V diffusion.
  • Figure 2: Overview of FlexWorld. FlexWorld trains a strong V2V diffusion capable of generating high-quality videos from incomplete views rendered from coarse 3D scenes. It progressively expands the 3D scene by adding new 3D content estimated from the refined videos via a dense stereo model. Ultimately, from a single image, it yields a detailed 3D scene capable of rendering flexible viewpoints.
  • Figure 3: We improve our video diffusion model to enable generating 3D consistent videos under large camera variation. We present novel views generated from each model when the camera is rotated 180 degrees to the left. The red bounding box indicates 3D inconsistency or poor visual quality in the generated content. Our model generates higher quality and more consistent static 3D scenes.
  • Figure 4: Our dataset construction method yields more accurate training pairs. We present frames of incomplete videos rendered from initial point clouds generated by a dense stereo model MASt3R leroy2024grounding (i.e., ViewCrafter viewcrafter's dataset construction method) and our 3DGS reconstruction. Our approach produces incomplete videos with better alignment to ground truth, resulting in higher-quality training pairs.
  • Figure 5: Qualitative comparison on novel view synthesis. We assessed the generative capabilities of various models using the same camera trajectory, focusing on the midpoint. The green bounding box in the ground truth highlights regions requiring consistency with the input, while the remaining areas demand coherent content generation. The red bounding box marks low-quality outputs in baseline models. Our model demonstrates superior visual generation quality, even under effectively controlled camera conditions.
  • ...and 5 more figures