Table of Contents
Fetching ...

Unlocking Pretrained LLMs for Motion-Related Multimodal Generation: A Fine-Tuning Approach to Unify Diffusion and Next-Token Prediction

Shinichi Tanaka, Zhao Wang, Yoichi Kato, Jun Ohya

TL;DR

MoMug presents a unified framework that fine-tunes a pretrained LLM with LoRA to jointly support diffusion-based continuous motion generation and autoregressive text generation within a single model. By embedding text, diffusion timesteps, and motion frames into a shared sequence with special tokens, MoMug achieves strong text-to-motion and motion-to-text performance while maintaining low training costs. Empirical results on HumanML3D and KIT-ML demonstrate competitive gains in motion quality, alignment, and captioning accuracy, outperforming several diffusion- and LLM-based baselines. The work highlights a practical path toward high-quality, cost-efficient motion synthesis through cross-modal single-model learning and sets the stage for broader motion-related multimodal generation.

Abstract

In this paper, we propose a unified framework that leverages a single pretrained LLM for Motion-related Multimodal Generation, referred to as MoMug. MoMug integrates diffusion-based continuous motion generation with the model's inherent autoregressive discrete text prediction capabilities by fine-tuning a pretrained LLM. This enables seamless switching between continuous motion output and discrete text token prediction within a single model architecture, effectively combining the strengths of both diffusion- and LLM-based approaches. Experimental results show that, compared to the most recent LLM-based baseline, MoMug improves FID by 38% and mean accuracy across seven metrics by 16.61% on the text-to-motion task. Additionally, it improves mean accuracy across eight metrics by 8.44% on the text-to-motion task. To the best of our knowledge, this is the first approach to integrate diffusion- and LLM-based generation within a single model for motion-related multimodal tasks while maintaining low training costs. This establishes a foundation for future advancements in motion-related generation, paving the way for high-quality yet cost-efficient motion synthesis.

Unlocking Pretrained LLMs for Motion-Related Multimodal Generation: A Fine-Tuning Approach to Unify Diffusion and Next-Token Prediction

TL;DR

MoMug presents a unified framework that fine-tunes a pretrained LLM with LoRA to jointly support diffusion-based continuous motion generation and autoregressive text generation within a single model. By embedding text, diffusion timesteps, and motion frames into a shared sequence with special tokens, MoMug achieves strong text-to-motion and motion-to-text performance while maintaining low training costs. Empirical results on HumanML3D and KIT-ML demonstrate competitive gains in motion quality, alignment, and captioning accuracy, outperforming several diffusion- and LLM-based baselines. The work highlights a practical path toward high-quality, cost-efficient motion synthesis through cross-modal single-model learning and sets the stage for broader motion-related multimodal generation.

Abstract

In this paper, we propose a unified framework that leverages a single pretrained LLM for Motion-related Multimodal Generation, referred to as MoMug. MoMug integrates diffusion-based continuous motion generation with the model's inherent autoregressive discrete text prediction capabilities by fine-tuning a pretrained LLM. This enables seamless switching between continuous motion output and discrete text token prediction within a single model architecture, effectively combining the strengths of both diffusion- and LLM-based approaches. Experimental results show that, compared to the most recent LLM-based baseline, MoMug improves FID by 38% and mean accuracy across seven metrics by 16.61% on the text-to-motion task. Additionally, it improves mean accuracy across eight metrics by 8.44% on the text-to-motion task. To the best of our knowledge, this is the first approach to integrate diffusion- and LLM-based generation within a single model for motion-related multimodal tasks while maintaining low training costs. This establishes a foundation for future advancements in motion-related generation, paving the way for high-quality yet cost-efficient motion synthesis.

Paper Structure

This paper contains 27 sections, 9 equations, 4 figures, 11 tables, 2 algorithms.

Figures (4)

  • Figure 1: (a) and (b) illustrate the pros and cons of diffusion-based and LLM-based approaches, which motivate us to leverage a pre-trained LLM to take advantage of both. As shown in (c), our MoMug—the first work to unify both approaches in a single pre-trained LLM—significantly improves performance in both text-to-motion and motion-to-text tasks across multiple metrics.
  • Figure 2: Overview of MoMug. The model consists of a pretrained LLM with LoRA fine-tuning and motion diffusion modeling, seamlessly switching between text-to-motion and motion-to-text modes. It processes mixed input sequences containing text tokens, the diffusion timestep $t$, and motion frames $\mathbf{x}^{\text{mot}}_{t,[1:N]}$. During training, in text-to-motion mode (a), the model optimizes both the language modeling loss $\mathcal{L}_{\text{LM}}$ and the diffusion modeling loss $\mathcal{L}_{\text{DDPM}}$, while in motion-to-text mode (c), $\mathcal{L}_{\text{DDPM}} = 0$. During inference, the model generates motion sequences via diffusion sampling (b) and text via LLM next-token prediction (d).
  • Figure : Unified Training
  • Figure : Sampling