Table of Contents
Fetching ...

Global Position Aware Group Choreography using Large Language Model

Haozhou Pang, Tianwei Ding, Lanshan He, Qi Gan

TL;DR

The paper tackles music-conditioned multi-person dance generation by casting group choreography as a sequence-to-sequence task processed by a fine-tuned Large Language Model. It introduces MotionRVQ to tokenize motion and uses Encodec for audio tokens, enabling a two-phase cross-modal pretraining and supervised fine-tuning regime. A key contribution is global position-based prompting, where Hilbert-curve position tokens guide coordination and long-sequence inference to maintain group formations. Experiments on the AIOZ-GDance dataset show state-of-the-art performance on group metrics and strong qualitative results, with ablations confirming the value of pretraining and positional guidance for reducing inter-dancer collisions and improving formation preservation.

Abstract

Dance serves as a profound and universal expression of human culture, conveying emotions and stories through movements synchronized with music. Although some current works have achieved satisfactory results in the task of single-person dance generation, the field of multi-person dance generation remains relatively novel. In this work, we present a group choreography framework that leverages recent advancements in Large Language Models (LLM) by modeling the group dance generation problem as a sequence-to-sequence translation task. Our framework consists of a tokenizer that transforms continuous features into discrete tokens, and an LLM that is fine-tuned to predict motion tokens given the audio tokens. We show that by proper tokenization of input modalities and careful design of the LLM training strategies, our framework can generate realistic and diverse group dances while maintaining strong music correlation and dancer-wise consistency. Extensive experiments and evaluations demonstrate that our framework achieves state-of-the-art performance.

Global Position Aware Group Choreography using Large Language Model

TL;DR

The paper tackles music-conditioned multi-person dance generation by casting group choreography as a sequence-to-sequence task processed by a fine-tuned Large Language Model. It introduces MotionRVQ to tokenize motion and uses Encodec for audio tokens, enabling a two-phase cross-modal pretraining and supervised fine-tuning regime. A key contribution is global position-based prompting, where Hilbert-curve position tokens guide coordination and long-sequence inference to maintain group formations. Experiments on the AIOZ-GDance dataset show state-of-the-art performance on group metrics and strong qualitative results, with ablations confirming the value of pretraining and positional guidance for reducing inter-dancer collisions and improving formation preservation.

Abstract

Dance serves as a profound and universal expression of human culture, conveying emotions and stories through movements synchronized with music. Although some current works have achieved satisfactory results in the task of single-person dance generation, the field of multi-person dance generation remains relatively novel. In this work, we present a group choreography framework that leverages recent advancements in Large Language Models (LLM) by modeling the group dance generation problem as a sequence-to-sequence translation task. Our framework consists of a tokenizer that transforms continuous features into discrete tokens, and an LLM that is fine-tuned to predict motion tokens given the audio tokens. We show that by proper tokenization of input modalities and careful design of the LLM training strategies, our framework can generate realistic and diverse group dances while maintaining strong music correlation and dancer-wise consistency. Extensive experiments and evaluations demonstrate that our framework achieves state-of-the-art performance.

Paper Structure

This paper contains 16 sections, 5 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Framework overview. Our method consists of data tokenization and LLM processing. We transfer motions, global root positions, and audios into discrete tokens, respectively. After that, we carefully design the prompts and do LLM pretrain and tuning.
  • Figure 2: Visualization of different methods. Bailando tends to generate dance with cross-body intersection problem. Lodge generates dance with constrained root movements, resulting in less diverse group formation. Our method, empowered by global position guidance, enables more diverse formation patterns while significantly reducing character collision probabilities. For additional visual comparisons, please refer to the supplementary videos.
  • Figure 3: SFT with/without Global Position Guidance. In the prompt without global position guidance, a token <$n_c$> following the audio tokens indicates the total amount of characters, and there is an id token <$c_i$> for each character leading the motion tokens.