Table of Contents
Fetching ...

MixerMDM: Learnable Composition of Human Motion Diffusion Models

Pablo Ruiz-Ponce, German Barquero, Cristina Palmero, Sergio Escalera, José García-Rodríguez

TL;DR

MixerMDM introduces a learnable model-composition framework for combining pre-trained text-conditioned human motion diffusion models. A Mixer module, implemented as a Transformer encoder followed by an MLP, predicts dynamic mixing weights $w_t$ at each denoising timestep $t$, enabling fine-grained, condition-aware blending of two models’ outputs. The training uses adversarial losses with a discriminator per pre-trained model, guiding the mixer to preserve characteristics of each model without ground-truth blends. Evaluations on InterHuman and HumanML3D demonstrate improved interaction- and individual-alignment and adaptability, with modularity allowing swapping of pre-trained models without retraining; however, additional compute and potential training stability challenges are noted, pointing to future work on representation harmonization and efficiency.

Abstract

Generating human motion guided by conditions such as textual descriptions is challenging due to the need for datasets with pairs of high-quality motion and their corresponding conditions. The difficulty increases when aiming for finer control in the generation. To that end, prior works have proposed to combine several motion diffusion models pre-trained on datasets with different types of conditions, thus allowing control with multiple conditions. However, the proposed merging strategies overlook that the optimal way to combine the generation processes might depend on the particularities of each pre-trained generative model and also the specific textual descriptions. In this context, we introduce MixerMDM, the first learnable model composition technique for combining pre-trained text-conditioned human motion diffusion models. Unlike previous approaches, MixerMDM provides a dynamic mixing strategy that is trained in an adversarial fashion to learn to combine the denoising process of each model depending on the set of conditions driving the generation. By using MixerMDM to combine single- and multi-person motion diffusion models, we achieve fine-grained control on the dynamics of every person individually, and also on the overall interaction. Furthermore, we propose a new evaluation technique that, for the first time in this task, measures the interaction and individual quality by computing the alignment between the mixed generated motions and their conditions as well as the capabilities of MixerMDM to adapt the mixing throughout the denoising process depending on the motions to mix.

MixerMDM: Learnable Composition of Human Motion Diffusion Models

TL;DR

MixerMDM introduces a learnable model-composition framework for combining pre-trained text-conditioned human motion diffusion models. A Mixer module, implemented as a Transformer encoder followed by an MLP, predicts dynamic mixing weights at each denoising timestep , enabling fine-grained, condition-aware blending of two models’ outputs. The training uses adversarial losses with a discriminator per pre-trained model, guiding the mixer to preserve characteristics of each model without ground-truth blends. Evaluations on InterHuman and HumanML3D demonstrate improved interaction- and individual-alignment and adaptability, with modularity allowing swapping of pre-trained models without retraining; however, additional compute and potential training stability challenges are noted, pointing to future work on representation harmonization and efficiency.

Abstract

Generating human motion guided by conditions such as textual descriptions is challenging due to the need for datasets with pairs of high-quality motion and their corresponding conditions. The difficulty increases when aiming for finer control in the generation. To that end, prior works have proposed to combine several motion diffusion models pre-trained on datasets with different types of conditions, thus allowing control with multiple conditions. However, the proposed merging strategies overlook that the optimal way to combine the generation processes might depend on the particularities of each pre-trained generative model and also the specific textual descriptions. In this context, we introduce MixerMDM, the first learnable model composition technique for combining pre-trained text-conditioned human motion diffusion models. Unlike previous approaches, MixerMDM provides a dynamic mixing strategy that is trained in an adversarial fashion to learn to combine the denoising process of each model depending on the set of conditions driving the generation. By using MixerMDM to combine single- and multi-person motion diffusion models, we achieve fine-grained control on the dynamics of every person individually, and also on the overall interaction. Furthermore, we propose a new evaluation technique that, for the first time in this task, measures the interaction and individual quality by computing the alignment between the mixed generated motions and their conditions as well as the capabilities of MixerMDM to adapt the mixing throughout the denoising process depending on the motions to mix.

Paper Structure

This paper contains 25 sections, 4 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: We introduce MixerMDM, the first learnable model composition technique for combining pre-trained text-conditioned human motion diffusion models. MixerMDM has demonstrated a consistent ability to generate highly controllable human interactions by combining a model that generates individual motions from textual descriptions with a model that creates human-human interactions.
  • Figure 2: MixerMDM pipeline. At each timestep $t$ of the denoising process, a mixed motion is generated by first obtaining motions from separate text-conditioned pre-trained motion diffusion models. Using these motions and their conditions, the Mixer predicts unique mixing weights that are subsequently used in the Mixing procedure to blend the generated motions and obtain the mixed motion $x^m_t$.
  • Figure 3: Mixer architecture. The Mixer is composed of a Transformer encoder that takes as input both generated motions by the pre-trained models, their respective conditions, and the actual timestep of the denoising process. This encoder generates a latent representation, which is decoded by an MLP that outputs the mixing weights. $T$: number of frames of the motion sequence.
  • Figure 4: Adversarial training. Each pre-trained model has a specific discriminator that is trained with a hinge loss. We use the outputs of the pre-trained model as positive samples, and the mixed predictions generated by MixerMDM as negative samples.
  • Figure 5: Mean mixing weights. The mean mixing weights of the best models for each variation of the Mixer output. Previous model composition techniques appear in the Global plot with a dotted line (DiffusionBlending related:interaction-1) and a dashed line (DualMDM in2in).
  • ...and 4 more figures