Table of Contents
Fetching ...

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

Zijie Wu, Chaohui Yu, Yanqin Jiang, Chenjie Cao, Fan Wang, Xiang Bai

TL;DR

SC4D tackles single-view video-to-4D generation by decoupling motion and appearance into sparse control points and dense Gaussian fields. It introduces Adaptive Gaussian initialization and Gaussian Alignment loss to address shape degeneration during refinement, enabling a robust coarse-to-fine optimization. The framework achieves superior reference-view alignment, spatio-temporal consistency, and motion fidelity while improving efficiency, and it enables a novel motion-transfer pipeline guided by textual descriptions. A depth-conditioned ControlNet–assisted transfer demonstrates flexible animation of new entities using learned motion. Overall, SC4D advances practical video-to-4D generation and opens pathways for text-guided motion transfer across diverse 4D objects.

Abstract

Recent advances in 2D/3D generative models enable the generation of dynamic 3D objects from a single-view video. Existing approaches utilize score distillation sampling to form the dynamic scene as dynamic NeRF or dense 3D Gaussians. However, these methods struggle to strike a balance among reference view alignment, spatio-temporal consistency, and motion fidelity under single-view conditions due to the implicit nature of NeRF or the intricate dense Gaussian motion prediction. To address these issues, this paper proposes an efficient, sparse-controlled video-to-4D framework named SC4D, that decouples motion and appearance to achieve superior video-to-4D generation. Moreover, we introduce Adaptive Gaussian (AG) initialization and Gaussian Alignment (GA) loss to mitigate shape degeneration issue, ensuring the fidelity of the learned motion and shape. Comprehensive experimental results demonstrate that our method surpasses existing methods in both quality and efficiency. In addition, facilitated by the disentangled modeling of motion and appearance of SC4D, we devise a novel application that seamlessly transfers the learned motion onto a diverse array of 4D entities according to textual descriptions.

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

TL;DR

SC4D tackles single-view video-to-4D generation by decoupling motion and appearance into sparse control points and dense Gaussian fields. It introduces Adaptive Gaussian initialization and Gaussian Alignment loss to address shape degeneration during refinement, enabling a robust coarse-to-fine optimization. The framework achieves superior reference-view alignment, spatio-temporal consistency, and motion fidelity while improving efficiency, and it enables a novel motion-transfer pipeline guided by textual descriptions. A depth-conditioned ControlNet–assisted transfer demonstrates flexible animation of new entities using learned motion. Overall, SC4D advances practical video-to-4D generation and opens pathways for text-guided motion transfer across diverse 4D objects.

Abstract

Recent advances in 2D/3D generative models enable the generation of dynamic 3D objects from a single-view video. Existing approaches utilize score distillation sampling to form the dynamic scene as dynamic NeRF or dense 3D Gaussians. However, these methods struggle to strike a balance among reference view alignment, spatio-temporal consistency, and motion fidelity under single-view conditions due to the implicit nature of NeRF or the intricate dense Gaussian motion prediction. To address these issues, this paper proposes an efficient, sparse-controlled video-to-4D framework named SC4D, that decouples motion and appearance to achieve superior video-to-4D generation. Moreover, we introduce Adaptive Gaussian (AG) initialization and Gaussian Alignment (GA) loss to mitigate shape degeneration issue, ensuring the fidelity of the learned motion and shape. Comprehensive experimental results demonstrate that our method surpasses existing methods in both quality and efficiency. In addition, facilitated by the disentangled modeling of motion and appearance of SC4D, we devise a novel application that seamlessly transfers the learned motion onto a diverse array of 4D entities according to textual descriptions.
Paper Structure (14 sections, 10 equations, 8 figures, 2 tables)

This paper contains 14 sections, 10 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: We illustrate: (a) video-to-4D results of SC4D and corresponding control points visualizations, and (b) examples of our motion transfer applications in the figure.
  • Figure 2: Overall pipeline of the proposed SC4D. In the coarse stage, SC4D learns a proper shape and motion initialization with a set of sparse control Gaussians. Then, in the fine stage, we propose Adaptive Gaussian (AG) initialization, and Gaussian Alignment (GA) loss to prevent shape and motion degeneration, and jointly optimize control points, dense Gaussians, and deformation MLP for the final results.
  • Figure 3: Illustration of Adaptive Gaussian (AG) initialization. $s$ is the scaling parameter of control Gaussians in the coarse stage. $Ori~init$ represents randomly initializing all the dense Gaussians within a sphere in the canonical space.
  • Figure 4: Illustration of the pipeline for our motion transfer application.
  • Figure 5: Qualitative Comparisons. We compare our method with Consistent4D con4d and 4DGen 4dgen. For each instance, we render two viewpoints at two timesteps. We also visualize the sparse control points to show their correspondence with dense Gaussians.
  • ...and 3 more figures