Table of Contents
Fetching ...

Automatic Dance Video Segmentation for Understanding Choreography

Koki Endo, Shuhei Tsuchida, Tsukasa Fukusato, Takeo Igarashi

TL;DR

The paper tackles automatic segmentation of dance videos into short movements to aid practice. It proposes a multi-modal pipeline that fuses visual bone-vector features with audio Mel-spectrograms and processes them with a non-causal Temporal Convolutional Network to output frame-level segmentation probabilities, with peaks indicating segment boundaries. A new dataset of 1410 annotated dance videos from the AIST Dance Video Database is built, and ground-truth segmentation is represented as Gaussian-summed probabilities to capture annotator variability; ablation studies demonstrate the value of combining visual and audio features. An application illustrates how the segmentation can support practice by enabling looped playback of segments and adjustable peak-picking settings, highlighting practical impact for learning choreography across genres.

Abstract

Segmenting dance video into short movements is a popular way to easily understand dance choreography. However, it is currently done manually and requires a significant amount of effort by experts. That is, even if many dance videos are available on social media (e.g., TikTok and YouTube), it remains difficult for people, especially novices, to casually watch short video segments to practice dance choreography. In this paper, we propose a method to automatically segment a dance video into each movement. Given a dance video as input, we first extract visual and audio features: the former is computed from the keypoints of the dancer in the video, and the latter is computed from the Mel spectrogram of the music in the video. Next, these features are passed to a Temporal Convolutional Network (TCN), and segmentation points are estimated by picking peaks of the network output. To build our training dataset, we annotate segmentation points to dance videos in the AIST Dance Video Database, which is a shared database containing original street dance videos with copyright-cleared dance music. The evaluation study shows that the proposed method (i.e., combining the visual and audio features) can estimate segmentation points with high accuracy. In addition, we developed an application to help dancers practice choreography using the proposed method.

Automatic Dance Video Segmentation for Understanding Choreography

TL;DR

The paper tackles automatic segmentation of dance videos into short movements to aid practice. It proposes a multi-modal pipeline that fuses visual bone-vector features with audio Mel-spectrograms and processes them with a non-causal Temporal Convolutional Network to output frame-level segmentation probabilities, with peaks indicating segment boundaries. A new dataset of 1410 annotated dance videos from the AIST Dance Video Database is built, and ground-truth segmentation is represented as Gaussian-summed probabilities to capture annotator variability; ablation studies demonstrate the value of combining visual and audio features. An application illustrates how the segmentation can support practice by enabling looped playback of segments and adjustable peak-picking settings, highlighting practical impact for learning choreography across genres.

Abstract

Segmenting dance video into short movements is a popular way to easily understand dance choreography. However, it is currently done manually and requires a significant amount of effort by experts. That is, even if many dance videos are available on social media (e.g., TikTok and YouTube), it remains difficult for people, especially novices, to casually watch short video segments to practice dance choreography. In this paper, we propose a method to automatically segment a dance video into each movement. Given a dance video as input, we first extract visual and audio features: the former is computed from the keypoints of the dancer in the video, and the latter is computed from the Mel spectrogram of the music in the video. Next, these features are passed to a Temporal Convolutional Network (TCN), and segmentation points are estimated by picking peaks of the network output. To build our training dataset, we annotate segmentation points to dance videos in the AIST Dance Video Database, which is a shared database containing original street dance videos with copyright-cleared dance music. The evaluation study shows that the proposed method (i.e., combining the visual and audio features) can estimate segmentation points with high accuracy. In addition, we developed an application to help dancers practice choreography using the proposed method.
Paper Structure (27 sections, 9 equations, 11 figures, 2 tables)

This paper contains 27 sections, 9 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Overview of the proposed method. First, we extract visual and audio features from an input dance video. Then, the system automatically estimates the segmentation probability score based on a Temporal Convolutional Network (TCN), and final segmentation points can be obtained by simply picking peaks of the probability.
  • Figure 2: Overview of the proposed network. The input $\bm{X}$ is passed to the TCN for each row and then the fully connected layer for each column.
  • Figure 3: User interface of our annotation tool. (a) Dance video. (b) Seek bar and segmentation candidates. (c) Playback and skip buttons. (d) Pull-down to change the playback mode. (e) Pull-down to change the playback speed. (f) Buttons to load a video and submit an annotation.
  • Figure 4: Segmentation points annotated by each participant. (a) Basic dance. (b) Advanced dance. Different marker shapes and colors represent different participants. A black line represents a segmentation proportion.
  • Figure 5: Video frames of the basic dance for the confirmation task. (a) At beat 12. (b) At beat 13.
  • ...and 6 more figures