Table of Contents
Fetching ...

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

Zixuan Wang, Jia Jia, Shikun Sun, Haozhe Wu, Rong Han, Zhenyu Li, Di Tang, Jiaqing Zhou, Jiebo Luo

TL;DR

This work presents DCM, a new multi-modal 3D dataset, which for the first time combines camera movement with dance motion and music audio, and proposes DanceCamera3D, a transformer-based diffusion model that incorporates a novel body attention loss and a condition separation strategy.

Abstract

Choreographers determine what the dances look like, while cameramen determine the final presentation of dances. Recently, various methods and datasets have showcased the feasibility of dance synthesis. However, camera movement synthesis with music and dance remains an unsolved challenging problem due to the scarcity of paired data. Thus, we present DCM, a new multi-modal 3D dataset, which for the first time combines camera movement with dance motion and music audio. This dataset encompasses 108 dance sequences (3.2 hours) of paired dance-camera-music data from the anime community, covering 4 music genres. With this dataset, we uncover that dance camera movement is multifaceted and human-centric, and possesses multiple influencing factors, making dance camera synthesis a more challenging task compared to camera or dance synthesis alone. To overcome these difficulties, we propose DanceCamera3D, a transformer-based diffusion model that incorporates a novel body attention loss and a condition separation strategy. For evaluation, we devise new metrics measuring camera movement quality, diversity, and dancer fidelity. Utilizing these metrics, we conduct extensive experiments on our DCM dataset, providing both quantitative and qualitative evidence showcasing the effectiveness of our DanceCamera3D model. Code and video demos are available at https://github.com/Carmenw1203/DanceCamera3D-Official.

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

TL;DR

This work presents DCM, a new multi-modal 3D dataset, which for the first time combines camera movement with dance motion and music audio, and proposes DanceCamera3D, a transformer-based diffusion model that incorporates a novel body attention loss and a condition separation strategy.

Abstract

Choreographers determine what the dances look like, while cameramen determine the final presentation of dances. Recently, various methods and datasets have showcased the feasibility of dance synthesis. However, camera movement synthesis with music and dance remains an unsolved challenging problem due to the scarcity of paired data. Thus, we present DCM, a new multi-modal 3D dataset, which for the first time combines camera movement with dance motion and music audio. This dataset encompasses 108 dance sequences (3.2 hours) of paired dance-camera-music data from the anime community, covering 4 music genres. With this dataset, we uncover that dance camera movement is multifaceted and human-centric, and possesses multiple influencing factors, making dance camera synthesis a more challenging task compared to camera or dance synthesis alone. To overcome these difficulties, we propose DanceCamera3D, a transformer-based diffusion model that incorporates a novel body attention loss and a condition separation strategy. For evaluation, we devise new metrics measuring camera movement quality, diversity, and dancer fidelity. Utilizing these metrics, we conduct extensive experiments on our DCM dataset, providing both quantitative and qualitative evidence showcasing the effectiveness of our DanceCamera3D model. Code and video demos are available at https://github.com/Carmenw1203/DanceCamera3D-Official.
Paper Structure (26 sections, 11 equations, 10 figures, 3 tables)

This paper contains 26 sections, 11 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: We present the DCM dataset, which contains 3.2 hours paired 3D Dance motion, Camera movement and Music audio.
  • Figure 2: Camera pose formats in our DCM dataset. (a) shows the original MMD format of camera pose including the position of RP, rotation and distance relative to RP, and Fov. (b) illustrates our Camera-Centric format consisting of the camera's Fov, global position, and rotation represented with x, y, and z vectors in the above figure.
  • Figure 3: Detailed distributions of our DCM dataset and split sets.
  • Figure 4: Overview of DanceCamera3D Framework. We adopt a transformer-based diffusion architecture to synthesize dance camera movement given music audio and dance pose as conditions. DanceCamera3D takes above conditions and a noisy sequence $\boldsymbol{z}_{T} \sim \mathcal{N}(0,\boldsymbol{I})$ as input and predicts noiseless sample $\hat{\boldsymbol{x}}$. Then we diffuse back $\hat{\boldsymbol{x}}$ and repeat the denoising process until $t=0$ to acquire final results.
  • Figure 5: Illustration of the training process and losses. For each randomly sampled timestep $t$, we diffuse back the ground truth sequence to a noisy sequence. Then DanceCamera3D takes conditions, timestep, and a noisy sequence to predict camera movements $\hat{\boldsymbol{x}}$. We propose to detect joint masks indicating joints inside the camera view and devise the body attention loss $\mathcal{L}_{ba}$ based on joint masks which are represented with dots on the joints.
  • ...and 5 more figures