Automatic Dance Video Segmentation for Understanding Choreography
Koki Endo, Shuhei Tsuchida, Tsukasa Fukusato, Takeo Igarashi
TL;DR
The paper tackles automatic segmentation of dance videos into short movements to aid practice. It proposes a multi-modal pipeline that fuses visual bone-vector features with audio Mel-spectrograms and processes them with a non-causal Temporal Convolutional Network to output frame-level segmentation probabilities, with peaks indicating segment boundaries. A new dataset of 1410 annotated dance videos from the AIST Dance Video Database is built, and ground-truth segmentation is represented as Gaussian-summed probabilities to capture annotator variability; ablation studies demonstrate the value of combining visual and audio features. An application illustrates how the segmentation can support practice by enabling looped playback of segments and adjustable peak-picking settings, highlighting practical impact for learning choreography across genres.
Abstract
Segmenting dance video into short movements is a popular way to easily understand dance choreography. However, it is currently done manually and requires a significant amount of effort by experts. That is, even if many dance videos are available on social media (e.g., TikTok and YouTube), it remains difficult for people, especially novices, to casually watch short video segments to practice dance choreography. In this paper, we propose a method to automatically segment a dance video into each movement. Given a dance video as input, we first extract visual and audio features: the former is computed from the keypoints of the dancer in the video, and the latter is computed from the Mel spectrogram of the music in the video. Next, these features are passed to a Temporal Convolutional Network (TCN), and segmentation points are estimated by picking peaks of the network output. To build our training dataset, we annotate segmentation points to dance videos in the AIST Dance Video Database, which is a shared database containing original street dance videos with copyright-cleared dance music. The evaluation study shows that the proposed method (i.e., combining the visual and audio features) can estimate segmentation points with high accuracy. In addition, we developed an application to help dancers practice choreography using the proposed method.
