Table of Contents
Fetching ...

SuperPADL: Scaling Language-Directed Physics-Based Control with Progressive Supervised Distillation

Jordan Juravsky, Yunrong Guo, Sanja Fidler, Xue Bin Peng

TL;DR

SuperPADL addresses the scalability bottleneck in physics-based, language-directed character control by blending reinforcement learning with supervised distillation. The method trains many specialized RL experts per motion, aggregates them into group controllers with PADL+BC, and finally distills them into a single global text-conditioned policy that can generalize to thousands of skills in real time. Key findings show that this progressive distillation approach outperforms RL-only baselines at large data scales, enables fluent skill transitions, and maintains interactive language responsiveness. The work enables practical, scalable, language-driven physics-based animation with interactive capabilities on consumer hardware.

Abstract

Physically-simulated models for human motion can generate high-quality responsive character animations, often in real-time. Natural language serves as a flexible interface for controlling these models, allowing expert and non-expert users to quickly create and edit their animations. Many recent physics-based animation methods, including those that use text interfaces, train control policies using reinforcement learning (RL). However, scaling these methods beyond several hundred motions has remained challenging. Meanwhile, kinematic animation models are able to successfully learn from thousands of diverse motions by leveraging supervised learning methods. Inspired by these successes, in this work we introduce SuperPADL, a scalable framework for physics-based text-to-motion that leverages both RL and supervised learning to train controllers on thousands of diverse motion clips. SuperPADL is trained in stages using progressive distillation, starting with a large number of specialized experts using RL. These experts are then iteratively distilled into larger, more robust policies using a combination of reinforcement learning and supervised learning. Our final SuperPADL controller is trained on a dataset containing over 5000 skills and runs in real time on a consumer GPU. Moreover, our policy can naturally transition between skills, allowing for users to interactively craft multi-stage animations. We experimentally demonstrate that SuperPADL significantly outperforms RL-based baselines at this large data scale.

SuperPADL: Scaling Language-Directed Physics-Based Control with Progressive Supervised Distillation

TL;DR

SuperPADL addresses the scalability bottleneck in physics-based, language-directed character control by blending reinforcement learning with supervised distillation. The method trains many specialized RL experts per motion, aggregates them into group controllers with PADL+BC, and finally distills them into a single global text-conditioned policy that can generalize to thousands of skills in real time. Key findings show that this progressive distillation approach outperforms RL-only baselines at large data scales, enables fluent skill transitions, and maintains interactive language responsiveness. The work enables practical, scalable, language-driven physics-based animation with interactive capabilities on consumer hardware.

Abstract

Physically-simulated models for human motion can generate high-quality responsive character animations, often in real-time. Natural language serves as a flexible interface for controlling these models, allowing expert and non-expert users to quickly create and edit their animations. Many recent physics-based animation methods, including those that use text interfaces, train control policies using reinforcement learning (RL). However, scaling these methods beyond several hundred motions has remained challenging. Meanwhile, kinematic animation models are able to successfully learn from thousands of diverse motions by leveraging supervised learning methods. Inspired by these successes, in this work we introduce SuperPADL, a scalable framework for physics-based text-to-motion that leverages both RL and supervised learning to train controllers on thousands of diverse motion clips. SuperPADL is trained in stages using progressive distillation, starting with a large number of specialized experts using RL. These experts are then iteratively distilled into larger, more robust policies using a combination of reinforcement learning and supervised learning. Our final SuperPADL controller is trained on a dataset containing over 5000 skills and runs in real time on a consumer GPU. Moreover, our policy can naturally transition between skills, allowing for users to interactively craft multi-stage animations. We experimentally demonstrate that SuperPADL significantly outperforms RL-based baselines at this large data scale.
Paper Structure (24 sections, 8 equations, 7 figures, 4 tables)

This paper contains 24 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: An overview of the SuperPADL training process.
  • Figure 2: Network architectures for controllers at each stage of SuperPADL. All controllers are modelled with simple MLPs.
  • Figure 3: The distribution of training times for tracking experts. The majority of policies terminate training early in less than an hour upon attaining a sufficiently low tracking error.
  • Figure 4: Thresholded precision and recall metrics for the SuperPADL global controller as well as PADL and PADL+BC baselines. We observe that the SuperPADL global controller has consistently higher precision and recall.
  • Figure 5: Thresholded precision and recall metrics for our PADL+BC group controller and a PADL baseline. The PADL+BC controllers record stronger scores on both metrics. Standard deviation is calculated across four trained policies, each trained on a distinct motion group.
  • ...and 2 more figures