Table of Contents
Fetching ...

FlexMotion: Lightweight, Physics-Aware, and Controllable Human Motion Generation

Arvin Tashakori, Arash Tashakori, Gongbo Yang, Z. Jane Wang, Peyman Servati

TL;DR

FlexMotion tackles the challenge of generating controllable, physically plausible human motion with high efficiency by learning in a latent space and avoiding external physics simulators. It integrates a physics-aware multimodal Transformer autoencoder with a latent diffusion model conditioned on text and a Spatial Controllability Module to steer motion via biomechanical signals such as muscle activations, joint torques, and contact forces. The approach achieves superior realism, physical plausibility, and controllability across HumanML3D, KIT-ML, and FLAG3D, aided by OpenSim-augmented data, while maintaining computational efficiency through latent-space diffusion. This work highlights the value of embedding biomechanical constraints directly into generative models to enable fine-grained control suitable for animation, robotics, and HCI, and points to future work on more complex dynamics and real-world data alignment.

Abstract

Lightweight, controllable, and physically plausible human motion synthesis is crucial for animation, virtual reality, robotics, and human-computer interaction applications. Existing methods often compromise between computational efficiency, physical realism, or spatial controllability. We propose FlexMotion, a novel framework that leverages a computationally lightweight diffusion model operating in the latent space, eliminating the need for physics simulators and enabling fast and efficient training. FlexMotion employs a multimodal pre-trained Transformer encoder-decoder, integrating joint locations, contact forces, joint actuations and muscle activations to ensure the physical plausibility of the generated motions. FlexMotion also introduces a plug-and-play module, which adds spatial controllability over a range of motion parameters (e.g., joint locations, joint actuations, contact forces, and muscle activations). Our framework achieves realistic motion generation with improved efficiency and control, setting a new benchmark for human motion synthesis. We evaluate FlexMotion on extended datasets and demonstrate its superior performance in terms of realism, physical plausibility, and controllability.

FlexMotion: Lightweight, Physics-Aware, and Controllable Human Motion Generation

TL;DR

FlexMotion tackles the challenge of generating controllable, physically plausible human motion with high efficiency by learning in a latent space and avoiding external physics simulators. It integrates a physics-aware multimodal Transformer autoencoder with a latent diffusion model conditioned on text and a Spatial Controllability Module to steer motion via biomechanical signals such as muscle activations, joint torques, and contact forces. The approach achieves superior realism, physical plausibility, and controllability across HumanML3D, KIT-ML, and FLAG3D, aided by OpenSim-augmented data, while maintaining computational efficiency through latent-space diffusion. This work highlights the value of embedding biomechanical constraints directly into generative models to enable fine-grained control suitable for animation, robotics, and HCI, and points to future work on more complex dynamics and real-world data alignment.

Abstract

Lightweight, controllable, and physically plausible human motion synthesis is crucial for animation, virtual reality, robotics, and human-computer interaction applications. Existing methods often compromise between computational efficiency, physical realism, or spatial controllability. We propose FlexMotion, a novel framework that leverages a computationally lightweight diffusion model operating in the latent space, eliminating the need for physics simulators and enabling fast and efficient training. FlexMotion employs a multimodal pre-trained Transformer encoder-decoder, integrating joint locations, contact forces, joint actuations and muscle activations to ensure the physical plausibility of the generated motions. FlexMotion also introduces a plug-and-play module, which adds spatial controllability over a range of motion parameters (e.g., joint locations, joint actuations, contact forces, and muscle activations). Our framework achieves realistic motion generation with improved efficiency and control, setting a new benchmark for human motion synthesis. We evaluate FlexMotion on extended datasets and demonstrate its superior performance in terms of realism, physical plausibility, and controllability.

Paper Structure

This paper contains 23 sections, 20 equations, 3 figures, 10 tables, 2 algorithms.

Figures (3)

  • Figure 1: The proposed FlexMotion can generate physically-plausible human motion sequences using text prompt and spatial control over diverse motion kinematic properties, including (a) contact forces, (b) joint locations, (c) muscle activation, and (d) joint actuation.
  • Figure 2: Overview of the proposed FlexMotion framework. It consists of first, multimodal autoencoder, which maps motion kinematic and dynamic properties to latent space (Sec. \ref{['3.1']}), second, latent space motion diffusion model, which generates a motion sequence in latent space conditioned on text prompt (Sec. \ref{['3.2']}) and third, spatial controllability module, which adds further control to the generated motion (Sec. \ref{['3.3']}).
  • Figure 3: Overview of Physics-aware Multimodal Autoencoder. It maps diverse motion properties into the latent space and reconstructs them while enforcing physics-based loss terms (Sec. \ref{['3.1']}).