Table of Contents
Fetching ...

SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, Changick Kim

TL;DR

SplatFlow presents a unified framework for text-driven 3D Gaussian Splatting synthesis that jointly models multi-view images, depths, and camera poses. It combines a multi-view Rectified Flow model with a GSDecoder to decode latent outputs into pixel-aligned 3DGS, enabling direct generation and editing without per-scene optimization. Training-free inversion and inpainting capabilities support versatile 3D editing tasks, including object replacement, novel view synthesis, and camera pose estimation, demonstrated on MVImgNet and DL3DV-7K. The approach achieves competitive or superior generation quality and editing effectiveness compared to baselines, highlighting its potential as a versatile foundation model for 3D content creation and manipulation.

Abstract

Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts, thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.

SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

TL;DR

SplatFlow presents a unified framework for text-driven 3D Gaussian Splatting synthesis that jointly models multi-view images, depths, and camera poses. It combines a multi-view Rectified Flow model with a GSDecoder to decode latent outputs into pixel-aligned 3DGS, enabling direct generation and editing without per-scene optimization. Training-free inversion and inpainting capabilities support versatile 3D editing tasks, including object replacement, novel view synthesis, and camera pose estimation, demonstrated on MVImgNet and DL3DV-7K. The approach achieves competitive or superior generation quality and editing effectiveness compared to baselines, highlighting its potential as a versatile foundation model for 3D content creation and manipulation.

Abstract

Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts, thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.

Paper Structure

This paper contains 72 sections, 8 equations, 14 figures, 8 tables, 2 algorithms.

Figures (14)

  • Figure 1: SplatFlow for 3D Gaussian Splatting synthesis and its training-free applications. (a) Examples of direct 3D Gaussian Splatting (3DGS) generation only from text prompts, (b) Training-free applications, including 3DGS object editing, camera pose estimation, and novel view synthesis. SplatFlow seamlessly integrates these capabilities, showcasing its versatility in generating and editing complex 3D content.
  • Figure 2: Overview of SplatFlow. SplatFlow consists of two main components: a multi-view Rectified Flow (Section \ref{['sec:4.3_multiview_rf']}) model and a Gaussian Splat Decoder (Section \ref{['sec:4.2_gsdecoder']}). Conditioned on text, RF model generates multi-view latents—including image, depth, and Plücker ray coordinates. After an optimization process to estimate camera poses, the GSDecoder decodes these latents into pixel-aligned 3DGS.
  • Figure 3: Qualitative results in text-to-3DGS generation on MVImgNet and DL3DV validation sets. The first two rows are rendered scenes from the MVImgNet dataset, while the last two rows are from the DL3DV dataset. Our SplatFlow produces cohesive and realistic scenes with sharp details, accurately capturing the intricacies of real-world environments and accommodating diverse camera trajectories.
  • Figure 4: Qualitative results in 3D editing with MVInpainter cao2024mvinpainter and DGE chen2024dge. We show rendered scenes except for MVInpainter.
  • Figure 5: Qualitative results for camera pose estimation. Camera poses are estimated from multi-view images. Image border colors match each camera, with black cameras indicating GT poses.
  • ...and 9 more figures