DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

Zixuan Wang; Jia Jia; Shikun Sun; Haozhe Wu; Rong Han; Zhenyu Li; Di Tang; Jiaqing Zhou; Jiebo Luo

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

Zixuan Wang, Jia Jia, Shikun Sun, Haozhe Wu, Rong Han, Zhenyu Li, Di Tang, Jiaqing Zhou, Jiebo Luo

TL;DR

This work presents DCM, a new multi-modal 3D dataset, which for the first time combines camera movement with dance motion and music audio, and proposes DanceCamera3D, a transformer-based diffusion model that incorporates a novel body attention loss and a condition separation strategy.

Abstract

Choreographers determine what the dances look like, while cameramen determine the final presentation of dances. Recently, various methods and datasets have showcased the feasibility of dance synthesis. However, camera movement synthesis with music and dance remains an unsolved challenging problem due to the scarcity of paired data. Thus, we present DCM, a new multi-modal 3D dataset, which for the first time combines camera movement with dance motion and music audio. This dataset encompasses 108 dance sequences (3.2 hours) of paired dance-camera-music data from the anime community, covering 4 music genres. With this dataset, we uncover that dance camera movement is multifaceted and human-centric, and possesses multiple influencing factors, making dance camera synthesis a more challenging task compared to camera or dance synthesis alone. To overcome these difficulties, we propose DanceCamera3D, a transformer-based diffusion model that incorporates a novel body attention loss and a condition separation strategy. For evaluation, we devise new metrics measuring camera movement quality, diversity, and dancer fidelity. Utilizing these metrics, we conduct extensive experiments on our DCM dataset, providing both quantitative and qualitative evidence showcasing the effectiveness of our DanceCamera3D model. Code and video demos are available at https://github.com/Carmenw1203/DanceCamera3D-Official.

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

TL;DR

Abstract

Paper Structure (26 sections, 11 equations, 10 figures, 3 tables)

This paper contains 26 sections, 11 equations, 10 figures, 3 tables.

Introduction
Related Work
Dance and Camera Dataset
Dance Synthesis
Camera Control and Planning
The DCM Dataset
Dataset Collection and Preprocessing
Dataset Description
Dataset Split
Music & Dance Driven Camera Generation
Problem Formulation
DanceCamera3D Architecture
Training and Losses
Experiments
Experimental Setup
...and 11 more sections

Figures (10)

Figure 1: We present the DCM dataset, which contains 3.2 hours paired 3D Dance motion, Camera movement and Music audio.
Figure 2: Camera pose formats in our DCM dataset. (a) shows the original MMD format of camera pose including the position of RP, rotation and distance relative to RP, and Fov. (b) illustrates our Camera-Centric format consisting of the camera's Fov, global position, and rotation represented with x, y, and z vectors in the above figure.
Figure 3: Detailed distributions of our DCM dataset and split sets.
Figure 4: Overview of DanceCamera3D Framework. We adopt a transformer-based diffusion architecture to synthesize dance camera movement given music audio and dance pose as conditions. DanceCamera3D takes above conditions and a noisy sequence $\boldsymbol{z}_{T} \sim \mathcal{N}(0,\boldsymbol{I})$ as input and predicts noiseless sample $\hat{\boldsymbol{x}}$. Then we diffuse back $\hat{\boldsymbol{x}}$ and repeat the denoising process until $t=0$ to acquire final results.
Figure 5: Illustration of the training process and losses. For each randomly sampled timestep $t$, we diffuse back the ground truth sequence to a noisy sequence. Then DanceCamera3D takes conditions, timestep, and a noisy sequence to predict camera movements $\hat{\boldsymbol{x}}$. We propose to detect joint masks indicating joints inside the camera view and devise the body attention loss $\mathcal{L}_{ba}$ based on joint masks which are represented with dots on the joints.
...and 5 more figures

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

TL;DR

Abstract

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

Authors

TL;DR

Abstract

Table of Contents

Figures (10)