Table of Contents
Fetching ...

DanceCamAnimator: Keyframe-Based Controllable 3D Dance Camera Synthesis

Zixuan Wang, Jiayi Li, Xiaoyu Qin, Shikun Sun, Songtao Zhou, Jia Jia, Jiebo Luo

TL;DR

A novel end-to-end dance camera synthesis framework DanceCamAnimator, which imitates human animation procedures and shows powerful keyframe-based controllability with variable lengths and surpasses previous baselines quantitatively and qualitatively.

Abstract

Synthesizing camera movements from music and dance is highly challenging due to the contradicting requirements and complexities of dance cinematography. Unlike human movements, which are always continuous, dance camera movements involve both continuous sequences of variable lengths and sudden drastic changes to simulate the switching of multiple cameras. However, in previous works, every camera frame is equally treated and this causes jittering and unavoidable smoothing in post-processing. To solve these problems, we propose to integrate animator dance cinematography knowledge by formulating this task as a three-stage process: keyframe detection, keyframe synthesis, and tween function prediction. Following this formulation, we design a novel end-to-end dance camera synthesis framework \textbf{DanceCamAnimator}, which imitates human animation procedures and shows powerful keyframe-based controllability with variable lengths. Extensive experiments on the DCM dataset demonstrate that our method surpasses previous baselines quantitatively and qualitatively. Code will be available at \url{https://github.com/Carmenw1203/DanceCamAnimator-Official}.

DanceCamAnimator: Keyframe-Based Controllable 3D Dance Camera Synthesis

TL;DR

A novel end-to-end dance camera synthesis framework DanceCamAnimator, which imitates human animation procedures and shows powerful keyframe-based controllability with variable lengths and surpasses previous baselines quantitatively and qualitatively.

Abstract

Synthesizing camera movements from music and dance is highly challenging due to the contradicting requirements and complexities of dance cinematography. Unlike human movements, which are always continuous, dance camera movements involve both continuous sequences of variable lengths and sudden drastic changes to simulate the switching of multiple cameras. However, in previous works, every camera frame is equally treated and this causes jittering and unavoidable smoothing in post-processing. To solve these problems, we propose to integrate animator dance cinematography knowledge by formulating this task as a three-stage process: keyframe detection, keyframe synthesis, and tween function prediction. Following this formulation, we design a novel end-to-end dance camera synthesis framework \textbf{DanceCamAnimator}, which imitates human animation procedures and shows powerful keyframe-based controllability with variable lengths. Extensive experiments on the DCM dataset demonstrate that our method surpasses previous baselines quantitatively and qualitatively. Code will be available at \url{https://github.com/Carmenw1203/DanceCamAnimator-Official}.
Paper Structure (16 sections, 7 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 16 sections, 7 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Hierarchical dance-camera-making procedure by animators. According to the given music and dance, animators first select keyframes on the timeline. Next, animators set the camera parameters at each keyframe to capture the dance details or highlights. Then, for the non-keyframes between keyframes, animators produce the camera movements by editing tween curves that control the camera moving speed from one keyframe to the next. Finally, the 3D engine can render results with camera movements and dance.
  • Figure 2: Challenges in 3D dance camera synthesis. Dance camera movements are not entirely continuous because they consist of smooth complete shots and abrupt shot changes. Moreover, small disturbances can lead to big shakes of the dancer in the rendered video. These issues prevent neural networks from synthesizing satisfactory dance camera movements.
  • Figure 3: Overall framework of DanceCamAnimator. In the Camera Keyframe Detection stage, the model utilizes music-dance context and temporal keyframe history to generate subsequent temporal keyframe tags. Next, for each pair of adjacent keyframes, the Camera Keyframe Synthesis stage takes music-dance context and camera history as input to synthesize camera keyframe motions. Given camera keyframe motions, camera history, and music-dance context, the final stage predicts tween function values to calculate in-between non-keyframe camera movements. Encoders with the same name share structures in different stages but are trained separately. Stages 2$\&$3 are trained together and conducted alternately during inference.
  • Figure 4: Visualization Comparison. We rendered the ground truth data and results generated from our method and the baselines given a 2-second music-dance condition. Compared to the baselines, our DanceCamAnimator synthesizes dance camera movements with more shot changes in a short period of time. This comparison also shows the usage of filters in the baseline DanceCamera3D is unstable and carries the risk of erroneous smoothing, causing the character to deviate from the center of the camera view, thus validating that our designed no post-processing framework is meaningful.
  • Figure 5: Curves Comparison of Camera Parameters. Given the same music and dance input, we plot the camera curves of the ground truth and synthesized results of DanceCamera3D Wang_2024_CVPR and our DanceCamAnimator. Compared to DanceCamera3D, our method provides more stable movements during each complete shot. Meanwhile, our method better preserves the abrupt changes caused by shot switches. If we ablate the prediction of the tween function values and directly generate camera movements, the model would fail to produce smooth shots. This demonstrates the efficacy of our design in predicting tween function values. Here camera eye represents the position of the camera in the cartesian coordinate system.
  • ...and 1 more figures