Table of Contents
Fetching ...

UniMuMo: Unified Text, Music and Motion Generation

Han Yang, Kun Su, Yutong Zhang, Jiaben Chen, Kaizhi Qian, Gaowen Liu, Chuang Gan

TL;DR

UniMuMo tackles the challenge of unified generation across text, music, and motion by aligning unpaired music and motion data through rhythmic patterns and encoding them into a shared token space. It introduces a three-stage pipeline: joint music-motion tokenization using a shared codebook, music-motion generation from text with a parallel generation scheme, and music-motion captioning via a fine-tuned T5 decoder. The framework leverages pre-trained single-modality models to reduce compute while introducing novel architectural elements (joint codebook, MoE for motion, parallel streams, and full-attention captioning) to enable cross-modal generation and zero-shot capabilities. Empirical results demonstrate competitive performance across multiple unidirectional benchmarks and tasks, supported by ablations and alignment analyses that underline the importance of the joint tokenization and parallel generation design.

Abstract

We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with a single training task of music-motion joint generation. Moreover, the model is designed by fine-tuning existing pre-trained single-modality models, significantly reducing computational demands. Extensive experiments demonstrate that UniMuMo achieves competitive results on all unidirectional generation benchmarks across music, motion, and text modalities. Quantitative results are available in the \href{https://hanyangclarence.github.io/unimumo_demo/}{project page}.

UniMuMo: Unified Text, Music and Motion Generation

TL;DR

UniMuMo tackles the challenge of unified generation across text, music, and motion by aligning unpaired music and motion data through rhythmic patterns and encoding them into a shared token space. It introduces a three-stage pipeline: joint music-motion tokenization using a shared codebook, music-motion generation from text with a parallel generation scheme, and music-motion captioning via a fine-tuned T5 decoder. The framework leverages pre-trained single-modality models to reduce compute while introducing novel architectural elements (joint codebook, MoE for motion, parallel streams, and full-attention captioning) to enable cross-modal generation and zero-shot capabilities. Empirical results demonstrate competitive performance across multiple unidirectional benchmarks and tasks, supported by ablations and alignment analyses that underline the importance of the joint tokenization and parallel generation design.

Abstract

We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with a single training task of music-motion joint generation. Moreover, the model is designed by fine-tuning existing pre-trained single-modality models, significantly reducing computational demands. Extensive experiments demonstrate that UniMuMo achieves competitive results on all unidirectional generation benchmarks across music, motion, and text modalities. Quantitative results are available in the \href{https://hanyangclarence.github.io/unimumo_demo/}{project page}.
Paper Structure (24 sections, 5 equations, 5 figures, 10 tables)

This paper contains 24 sections, 5 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: UniMuMo is able to perform generation tasks on any combination of music, motion, and text. The tasks shown in the figure include text-to-aligned-music-motion, music-to-motion, motion-to-music, music-captioning, and motion-captioning.
  • Figure 2: Overview: The training of UniMuMo consists of three stages: In stage 1, we train a motion RVQ-VAE using the frozen codebook from a pre-trained music RVQ-VAE to encode motion into the same space as music. In stage 2, we fine-tune a pre-trained music transformer decoder model on the text-to-music-motion task using the music-motion parallel generation scheme. In stage 3, we fine-tune a T5 decoder for music-motion captioning using the previous music-motion decoder as a feature extractor.
  • Figure 3: Illustrations on the technical details in our training process.
  • Figure 4: Illustrations on the technical details in the inference process.
  • Figure 5: A screen shot of the user study form for evaluating our music-motion alignment algorithm.