Table of Contents
Fetching ...

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Hyun-kyu Ko, Jihyeon Park, Younghyun Kim, Dongheok Park, Eunbyung Park

Abstract

Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Abstract

Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/
Paper Structure (38 sections, 5 equations, 16 figures, 3 tables)

This paper contains 38 sections, 5 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: 3D-aware video customization using our proposed framework. Given a few multi-view reference images (left) and a text prompt, our approach generates high-fidelity, view-consistent videos that seamlessly integrate customized 3D subjects into dynamic environments.
  • Figure 2: Overview of the 3DreamBooth training pipeline. (Left) From multi-view images, one is selected as the target, while a sampled subset serves as reference conditions alongside a global prompt with a unique identifier $V$. (Right) The text and noisy target latents pass through the main branch (3DB LoRA), while reference latents pass through a shared 3Dapter. Their features are concatenated for Multi-view Joint Attention. This 1-frame optimization decouples spatial geometry from temporal dynamics to efficiently learn a 3D prior.
  • Figure 3: Convergence Analysis and Detail Preservation. (A) Reconstruction Loss: Integrating 3Dapter (blue) drastically accelerates convergence compared to the 3DreamBooth baseline (gray). (B) Qualitative Comparison: 3DreamBooth alone (purple dot) struggles with high-frequency details due to the information bottleneck. In contrast, 3Dapter+3DreamBooth (yellow dot) perfectly preserves intricate textures (e.g., "RIO" typography) much earlier, demonstrating the efficacy of explicit visual priors.
  • Figure 4: Detailed architecture of our two-stage conditioning mechanism. (A) Single-view Pre-training: The visual adapter (3Dapter) is pre-trained using single-view references and fused via Single-view Joint Attention. (B) Multi-view Joint Optimization: A trainable 3DreamBooth LoRA is added to the main branch. A minimal set of multi-view reference images is processed in parallel by the shared 3Dapter. The Multi-view Joint Attention acts as a dynamic selective router, querying relevant view-specific geometric hints to reconstruct the target view.
  • Figure 5: Visualization of the Dynamic Selective Router mechanism. (Left) Generated frames and four multi-view conditions provided to 3Dapter. The generated poses align with View 2 (red box). (Right) Cross-attention heatmaps across diffusion timesteps ($t=0,20,40$). The network selectively assigns higher attention weights to the relevant view (View 2) to extract specific geometric features, rather than uniformly aggregating all conditions.
  • ...and 11 more figures