Table of Contents
Fetching ...

PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Zhongyu Jiang, Wenhao Chai, Zhuoran Zhou, Cheng-Yen Yang, Hsiang-Wei Huang, Jenq-Neng Hwang

TL;DR

PackDiT introduces a diffusion-based, multi-task framework for joint human motion and text generation by coupling two independent diffusion transformers, Motion DiT and Text DiT, through mutual prompting. The model supports text-to-motion, motion-to-text, motion prediction, text generation, and joint motion-text generation, trained in staged procedures including unconditional pre-training, joint generation, and task-specific fine-tuning. On HumanML3D, PackDiT achieves state-of-the-art text-to-motion performance with an FID of 0.106 and demonstrates strong motion-to-text and in-between capabilities, including diffusion-based motion-to-text results comparable to autoregressive and LLM-based approaches. The mutual-prompting mechanism, cross-attention between modality-specific DiTs, and modular training enable a flexible, scalable, and high-fidelity framework with broad applicability to synthetic data generation and immersive multi-modal experiences.

Abstract

Human motion generation has advanced markedly with the advent of diffusion models. Most recent studies have concentrated on generating motion sequences based on text prompts, commonly referred to as text-to-motion generation. However, the bidirectional generation of motion and text, enabling tasks such as motion-to-text alongside text-to-motion, has been largely unexplored. This capability is essential for aligning diverse modalities and supports unconditional generation. In this paper, we introduce PackDiT, the first diffusion-based generative model capable of performing various tasks simultaneously, including motion generation, motion prediction, text generation, text-to-motion, motion-to-text, and joint motion-text generation. Our core innovation leverages mutual blocks to integrate multiple diffusion transformers (DiTs) across different modalities seamlessly. We train PackDiT on the HumanML3D dataset, achieving state-of-the-art text-to-motion performance with an FID score of 0.106, along with superior results in motion prediction and in-between tasks. Our experiments further demonstrate that diffusion models are effective for motion-to-text generation, achieving performance comparable to that of autoregressive models.

PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

TL;DR

PackDiT introduces a diffusion-based, multi-task framework for joint human motion and text generation by coupling two independent diffusion transformers, Motion DiT and Text DiT, through mutual prompting. The model supports text-to-motion, motion-to-text, motion prediction, text generation, and joint motion-text generation, trained in staged procedures including unconditional pre-training, joint generation, and task-specific fine-tuning. On HumanML3D, PackDiT achieves state-of-the-art text-to-motion performance with an FID of 0.106 and demonstrates strong motion-to-text and in-between capabilities, including diffusion-based motion-to-text results comparable to autoregressive and LLM-based approaches. The mutual-prompting mechanism, cross-attention between modality-specific DiTs, and modular training enable a flexible, scalable, and high-fidelity framework with broad applicability to synthetic data generation and immersive multi-modal experiences.

Abstract

Human motion generation has advanced markedly with the advent of diffusion models. Most recent studies have concentrated on generating motion sequences based on text prompts, commonly referred to as text-to-motion generation. However, the bidirectional generation of motion and text, enabling tasks such as motion-to-text alongside text-to-motion, has been largely unexplored. This capability is essential for aligning diverse modalities and supports unconditional generation. In this paper, we introduce PackDiT, the first diffusion-based generative model capable of performing various tasks simultaneously, including motion generation, motion prediction, text generation, text-to-motion, motion-to-text, and joint motion-text generation. Our core innovation leverages mutual blocks to integrate multiple diffusion transformers (DiTs) across different modalities seamlessly. We train PackDiT on the HumanML3D dataset, achieving state-of-the-art text-to-motion performance with an FID score of 0.106, along with superior results in motion prediction and in-between tasks. Our experiments further demonstrate that diffusion models are effective for motion-to-text generation, achieving performance comparable to that of autoregressive models.

Paper Structure

This paper contains 26 sections, 4 equations, 7 figures, 5 tables, 4 algorithms.

Figures (7)

  • Figure 1: The pseudo-code of different training stages of PackDiT depends on different tasks, e.g., unconditional pre-train, Text-to-Motion, and Motion-to-Text.
  • Figure 2: The architecture of PackDiT, where there are two independent DiTs for Motion and Text generation. By enabling and disabling the cross-attention layers in-between, PackDiT can solve almost all motion and text-related generation tasks, including text-to-motion, motion-to-text, motion prediction, motion in-between, random motion and text generation, and joint motion-text generation.
  • Figure 3: Training stages of the PackDiT model, illustrating the various phases, including a) unconditional pre-training, b) joint generation training, and c) task fine-tuning.
  • Figure 3: Visualization results of Text-to-Motion Generation via PackDiT.
  • Figure A1: More Motion in-Between visualization results of PackDiT. The orange avatars are from the ground truth motion, while the blue ones are generated by PackDiT.
  • ...and 2 more figures