Table of Contents
Fetching ...

VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

Hyojun Go, Byeongjun Park, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, Changick Kim

TL;DR

VideoRFSplat tackles direct text-to-3D Gaussian Splatting for unbounded real-world scenes by coupling a dual-stream pose-generation module with a pre-trained video diffusion backbone via communication blocks. It introduces asynchronous sampling, decoupling pose and image timesteps to reduce mutual ambiguity, and augments this with a camera-conditioned CFG strategy, enabling stable joint generation without external refinements. The model employs a Plücker-ray representation and a Gaussian Splat Decoder to render high-fidelity 3DGS from generated latents, trained on RealEstate10K, MVImgNet, DL3DV-10K, and ACID. Empirically, VideoRFSplat achieves state-of-the-art results among direct text-to-3DGS methods without SDS refinements across T3Bench, MVImgNet, and DL3DV, demonstrating improved realism, pose consistency, and text alignment suitable for real-world scene synthesis.

Abstract

We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement.

VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

TL;DR

VideoRFSplat tackles direct text-to-3D Gaussian Splatting for unbounded real-world scenes by coupling a dual-stream pose-generation module with a pre-trained video diffusion backbone via communication blocks. It introduces asynchronous sampling, decoupling pose and image timesteps to reduce mutual ambiguity, and augments this with a camera-conditioned CFG strategy, enabling stable joint generation without external refinements. The model employs a Plücker-ray representation and a Gaussian Splat Decoder to render high-fidelity 3DGS from generated latents, trained on RealEstate10K, MVImgNet, DL3DV-10K, and ACID. Empirically, VideoRFSplat achieves state-of-the-art results among direct text-to-3DGS methods without SDS refinements across T3Bench, MVImgNet, and DL3DV, demonstrating improved realism, pose consistency, and text alignment suitable for real-world scene synthesis.

Abstract

We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement.

Paper Structure

This paper contains 44 sections, 6 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Generated 3D Gaussian Splattings and rendered views from diverse texts by VideoRFSplat. VideoRFSplat directly generates realistic 3D scenes from text without SDS poole2023dreamfusionli2025director3d refinement, outperforming prior methods go2024splatflowli2025director3d that rely on SDS refinements.
  • Figure 2: VideoRFSplat Overview. (a) VideoRFSplat consists of a dual-stream pose-video model and a Gaussian Splat decoder. To minimize pose-image interference, the pose model is side-attached to the pre-trained video model, interacting through communication blocks. With separate timesteps for pose and video models, this enables asynchronous sampling, reducing ambiguity and improving sampling stability. (b) Communication block, where cross-attention facilitates bidirectional information exchange between the pose and image modalities.
  • Figure 3: Failure analysis of synchronized sampling and the effectiveness of asynchronous sampling.(Left)Early in sampling ($t > 0.85$), synchronous sampling induces excessive oscillations in camera poses, causing divergence and misalignment with images. Then, misaligned poses lead to inconsistent multi-view generation, particularly in the background. (Right) Asynchronous sampling stabilizes joint generation, leading to coherent multi-view generation.
  • Figure 4: Asynchrnous schedule ($\delta=0.2$).
  • Figure 5: Qualitative comparison of text-to-3DGS generation on DL3DV ling2024dl3dv and MVImgNet yu2023mvimgnet validation sets as well as T3Bench he2023t. Rendered scenes: First two rows from DL3DV, the third row from T3Bench, and the last row from MVImgNet. Despite not using SDS++ li2025director3d, VideoRFSplat generates detailed, visually consistent scenes, producing appropriate scene-specific camera poses.
  • ...and 8 more figures