FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis
Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, Chongxuan Li
TL;DR
FlexWorld tackles the problem of generating flexible-view 3D content from a single image, addressing the lack of 3D data by combining a video-to-video diffusion module with a progressive, geometry-aware 3D scene expansion. The V2V diffusion model is fine-tuned on depth-estimated training pairs to produce high-quality novel views under large camera variations, while the 3D scene expansion progressively integrates new content via dense depth and 3D Gaussian splatting representations. A refinement stage and camera-trajectory planning ensure geometric coherence and high visual fidelity during 360° rotations and zooming. Across RealEstate10K and Tanks datasets, FlexWorld outperforms state-of-the-art baselines on both novel-view synthesis and 3D scene generation, demonstrating strong 3D consistency and flexible viewpoint capabilities from monocular input.
Abstract
Generating flexible-view 3D scenes, including 360° rotation and zooming, from single images is challenging due to a lack of 3D data. To this end, we introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360° rotations and zooming. Project page: https://ml-gsai.github.io/FlexWorld.
