Table of Contents
Fetching ...

Learning Quantised Structure-Preserving Motion Representations for Dance Fingerprinting

Arina Kharlamova, Bowei He, Chen Ma, Xue Liu

Abstract

We present DANCEMATCH, an end-to-end framework for motion-based dance retrieval, the task of identifying semantically similar choreographies directly from raw video, defined as DANCE FINGERPRINTING. While existing motion analysis and retrieval methods can compare pose sequences, they rely on continuous embeddings that are difficult to index, interpret, or scale. In contrast, DANCEMATCH constructs compact, discrete motion signatures that capture the spatio-temporal structure of dance while enabling efficient large-scale retrieval. Our system integrates Skeleton Motion Quantisation (SMQ) with Spatio-Temporal Transformers (STT) to encode human poses, extracted via Apple CoMotion, into a structured motion vocabulary. We further design DANCE RETRIEVAL ENGINE (DRE), which performs sub-linear retrieval using a histogram-based index followed by re-ranking for refined matching. To facilitate reproducible research, we release DANCETYPESBENCHMARK, a pose-aligned dataset annotated with quantised motion tokens. Experiments demonstrate robust retrieval across diverse dance styles and strong generalisation to unseen choreographies, establishing a foundation for scalable motion fingerprinting and quantitative choreographic analysis.

Learning Quantised Structure-Preserving Motion Representations for Dance Fingerprinting

Abstract

We present DANCEMATCH, an end-to-end framework for motion-based dance retrieval, the task of identifying semantically similar choreographies directly from raw video, defined as DANCE FINGERPRINTING. While existing motion analysis and retrieval methods can compare pose sequences, they rely on continuous embeddings that are difficult to index, interpret, or scale. In contrast, DANCEMATCH constructs compact, discrete motion signatures that capture the spatio-temporal structure of dance while enabling efficient large-scale retrieval. Our system integrates Skeleton Motion Quantisation (SMQ) with Spatio-Temporal Transformers (STT) to encode human poses, extracted via Apple CoMotion, into a structured motion vocabulary. We further design DANCE RETRIEVAL ENGINE (DRE), which performs sub-linear retrieval using a histogram-based index followed by re-ranking for refined matching. To facilitate reproducible research, we release DANCETYPESBENCHMARK, a pose-aligned dataset annotated with quantised motion tokens. Experiments demonstrate robust retrieval across diverse dance styles and strong generalisation to unseen choreographies, establishing a foundation for scalable motion fingerprinting and quantitative choreographic analysis.

Paper Structure

This paper contains 20 sections, 26 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Illustration of the proposed end-to-end framework for motion-based dance retrieval. The system processes an input dance video into a structured motion representation and retrieves top-matching dances through alignment-based similarity.
  • Figure 2: Overview of the DanceMatch framework. The Spatio-Temporal Transformer (STT) encoder extracts per-joint latent embeddings from skeletal motion sequences. Latent patches are discretised via vector quantisation into motion tokens using a learnable codebook. The STT decoder reconstructs the motion sequence from the quantised representation. Immediate dead-code revival and EMA-based updates ensure stable codebook utilisation and prevent mode collapse.
  • Figure 3: Two-stage DRE architecture: histogram-based indexing (Stage 1) provides a shortlist for re-ranking via multi-metric temporal alignment (Stage 2). Complexity per stage: $O(NK)$ vs. $O(L M_q \bar{M})$.