Table of Contents
Fetching ...

Controllable Motion Generation via Diffusion Modal Coupling

Luobin Wang, Hongzhan Yu, Chenning Yu, Sicun Gao, Henrik Christensen

TL;DR

This work tackles controllability in diffusion-based motion generation by introducing a Gaussian-mixture prior that couples each prior mode to a principal data mode, enabling direct, mode-level control during sampling without external guidance. By deriving modified forward and reverse diffusion processes and carefully parametrizing the priors to maintain mode separation, the approach achieves higher fidelity and controllability than post-hoc guidance baselines. Empirical results on Waymo and Maze2D demonstrate improved trajectory realism, feasibility, and per-task performance with a single unified model handling multiple modes, highlighting scalability and robustness. The framework eliminates train–test mismatch inherent in guidance methods and provides a principled path toward controllable, multi-modal robotic motion synthesis with practical impact for planning and forecasting.

Abstract

Diffusion models have recently gained significant attention in robotics due to their ability to generate multi-modal distributions of system states and behaviors. However, a key challenge remains: ensuring precise control over the generated outcomes without compromising realism. This is crucial for applications such as motion planning or trajectory forecasting, where adherence to physical constraints and task-specific objectives is essential. We propose a novel framework that enhances controllability in diffusion models by leveraging multi-modal prior distributions and enforcing strong modal coupling. This allows us to initiate the denoising process directly from distinct prior modes that correspond to different possible system behaviors, ensuring sampling to align with the training distribution. We evaluate our approach on motion prediction using the Waymo dataset and multi-task control in Maze2D environments. Experimental results show that our framework outperforms both guidance-based techniques and conditioned models with unimodal priors, achieving superior fidelity, diversity, and controllability, even in the absence of explicit conditioning. Overall, our approach provides a more reliable and scalable solution for controllable motion generation in robotics.

Controllable Motion Generation via Diffusion Modal Coupling

TL;DR

This work tackles controllability in diffusion-based motion generation by introducing a Gaussian-mixture prior that couples each prior mode to a principal data mode, enabling direct, mode-level control during sampling without external guidance. By deriving modified forward and reverse diffusion processes and carefully parametrizing the priors to maintain mode separation, the approach achieves higher fidelity and controllability than post-hoc guidance baselines. Empirical results on Waymo and Maze2D demonstrate improved trajectory realism, feasibility, and per-task performance with a single unified model handling multiple modes, highlighting scalability and robustness. The framework eliminates train–test mismatch inherent in guidance methods and provides a principled path toward controllable, multi-modal robotic motion synthesis with practical impact for planning and forecasting.

Abstract

Diffusion models have recently gained significant attention in robotics due to their ability to generate multi-modal distributions of system states and behaviors. However, a key challenge remains: ensuring precise control over the generated outcomes without compromising realism. This is crucial for applications such as motion planning or trajectory forecasting, where adherence to physical constraints and task-specific objectives is essential. We propose a novel framework that enhances controllability in diffusion models by leveraging multi-modal prior distributions and enforcing strong modal coupling. This allows us to initiate the denoising process directly from distinct prior modes that correspond to different possible system behaviors, ensuring sampling to align with the training distribution. We evaluate our approach on motion prediction using the Waymo dataset and multi-task control in Maze2D environments. Experimental results show that our framework outperforms both guidance-based techniques and conditioned models with unimodal priors, achieving superior fidelity, diversity, and controllability, even in the absence of explicit conditioning. Overall, our approach provides a more reliable and scalable solution for controllable motion generation in robotics.

Paper Structure

This paper contains 15 sections, 2 theorems, 29 equations, 4 figures, 3 tables.

Key Result

Lemma 1

Let $\eta_{t} := 1 + \sum_{m = 1}^{t-1} (\sqrt{\prod_{n = m + 1}^{t}\alpha_{n}})$, and consider the forward noising process where $\epsilon_{t} \sim \mathcal{N}(0, I)$. Then, for any step $t$, Under the standard assumption that $\bar{\alpha}_{T} \to 0$ as $T$ grows large, it follows that $q(x_{T}|x_{0}) = \mathcal{N}(x_{T}; \mu, \sigma^{2}I)$.

Figures (4)

  • Figure 1: High-level comparison of guidance-based approaches versus our proposed method. (a) A standard diffusion model fits a multi-modal data distribution (three modes). A guidance term (red) attempts to steer sampling toward a rare yet operationally critical mode (Mode II). (b) Standard (unguided) sampling concentrates on high-probability modes. (c) Guidance perturbs intermediate states off the well-trained data manifold, degrading fidelity. (d)(e) Our method couples each principal mode to a dedicated prior component, enabling direct, mode-aligned control at sampling while avoiding guidance-induced distribution mismatch.
  • Figure 2: High-level overview. Conventional diffusion models use a unimodal prior distribution and lack an intrinsic mechanism to select which trajectories are emphasized. We introduce a multi-modal prior and enforce strong modal coupling between prior and data via a novel diffusion process. The framework enables direct mode selection even with an unconditioned diffusion model, supporting precise and adaptive motion generation. In the figure, each prior component corresponds to one behavior. "ACC", "DEC" and "MSP" refer to speed modes (acceleration, deceleration, and maintaining speed), while "R", "L" and "S" represent steering modes (right, left, and straight).
  • Figure 3: 2D toy example. (a) The data distribution with four distinct modes. (b-c) Results from DDPM ho2020denoising show using unimodal prior yields spurious samples in the gaps between modes. (d-e) When prior means have large magnitude (i.e., lie far from the origin), the diffusion model struggles to recover realistic per-mode data distributions. (f-g) Insufficient separation between prior modes also prevents the model from accurately capturing the data distribution. (h-i) With a carefully designed prior parameterization that maintains clear separation between modes without introducing excessive values, our method produces substantially fewer spurious samples and further enables direct control over individual modes. Corresponding modes and samples share the same color.
  • Figure 4: Qualitative results for motion prediction. In standard diffusion, trajectories are sampled randomly from a unimodal prior, offering no inherent controllability. CG applies guidance to steer generation, but it relies heavily on the guidance influence factor, making it difficult to balance sample fidelity against controllability. CFG blends unconditional and conditional outputs for guidance, which limits controllability and fidelity when the target lies off the reference data manifold. Our method integrates modal coupling with a multi-modal prior distribution, yielding notable improvements in both sample fidelity and controllability.

Theorems & Definitions (2)

  • Lemma 1
  • Lemma 2