Table of Contents
Fetching ...

Bayesian-Optimized One-Step Diffusion Model with Knowledge Distillation for Real-Time 3D Human Motion Prediction

Sibo Tian, Minghui Zheng, Xiao Liang

TL;DR

This work tackles the challenge of real-time 3D human motion prediction by marrying diffusion-based sampling with a two-stage knowledge-distillation strategy. It first distills a state-of-the-art diffusion predictor (TransFusion) into a one-step diffusion model, then further distills this into a compact MLP-based denoiser, with Bayesian optimization guiding hyperparameter tuning. The resulting SwiftDiff achieves real-time inference with negligible loss in prediction quality while substantially outperforming prior diffusion methods in speed. This approach enables safer and more responsive human-robot collaboration by providing fast, multimodal motion predictions without sacrificing accuracy.

Abstract

Human motion prediction is a cornerstone of human-robot collaboration (HRC), as robots need to infer the future movements of human workers based on past motion cues to proactively plan their motion, ensuring safety in close collaboration scenarios. The diffusion model has demonstrated remarkable performance in predicting high-quality motion samples with reasonable diversity, but suffers from a slow generative process which necessitates multiple model evaluations, hindering real-world applications. To enable real-time prediction, in this work, we propose training a one-step multi-layer perceptron-based (MLP-based) diffusion model for motion prediction using knowledge distillation and Bayesian optimization. Our method contains two steps. First, we distill a pretrained diffusion-based motion predictor, TransFusion, directly into a one-step diffusion model with the same denoiser architecture. Then, to further reduce the inference time, we remove the computationally expensive components from the original denoiser and use knowledge distillation once again to distill the obtained one-step diffusion model into an even smaller model based solely on MLPs. Bayesian optimization is used to tune the hyperparameters for training the smaller diffusion model. Extensive experimental studies are conducted on benchmark datasets, and our model can significantly improve the inference speed, achieving real-time prediction without noticeable degradation in performance.

Bayesian-Optimized One-Step Diffusion Model with Knowledge Distillation for Real-Time 3D Human Motion Prediction

TL;DR

This work tackles the challenge of real-time 3D human motion prediction by marrying diffusion-based sampling with a two-stage knowledge-distillation strategy. It first distills a state-of-the-art diffusion predictor (TransFusion) into a one-step diffusion model, then further distills this into a compact MLP-based denoiser, with Bayesian optimization guiding hyperparameter tuning. The resulting SwiftDiff achieves real-time inference with negligible loss in prediction quality while substantially outperforming prior diffusion methods in speed. This approach enables safer and more responsive human-robot collaboration by providing fast, multimodal motion predictions without sacrificing accuracy.

Abstract

Human motion prediction is a cornerstone of human-robot collaboration (HRC), as robots need to infer the future movements of human workers based on past motion cues to proactively plan their motion, ensuring safety in close collaboration scenarios. The diffusion model has demonstrated remarkable performance in predicting high-quality motion samples with reasonable diversity, but suffers from a slow generative process which necessitates multiple model evaluations, hindering real-world applications. To enable real-time prediction, in this work, we propose training a one-step multi-layer perceptron-based (MLP-based) diffusion model for motion prediction using knowledge distillation and Bayesian optimization. Our method contains two steps. First, we distill a pretrained diffusion-based motion predictor, TransFusion, directly into a one-step diffusion model with the same denoiser architecture. Then, to further reduce the inference time, we remove the computationally expensive components from the original denoiser and use knowledge distillation once again to distill the obtained one-step diffusion model into an even smaller model based solely on MLPs. Bayesian optimization is used to tune the hyperparameters for training the smaller diffusion model. Extensive experimental studies are conducted on benchmark datasets, and our model can significantly improve the inference speed, achieving real-time prediction without noticeable degradation in performance.
Paper Structure (14 sections, 14 equations, 5 figures, 3 tables)

This paper contains 14 sections, 14 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: SwiftDiff is a fast one-step MLP-based diffusion model for real-time 3D human motion prediction, derived from TransFusion tian2024transfusion. It generates predictions of similar quality much faster than existing diffusion-based methods.
  • Figure 2: Architecture of the noise prediction network. Figures a, b, and c show the detailed structure of TransFusion, which is used as the teacher model in this work. Figures d and e show the structure of our proposed one-step MLP-based diffusion model for real-time human motion prediction.
  • Figure 3: Overview of knowledge distillation with mean squared error loss. The parameters of teacher model are frozen during distillation. only the parameters of student model are updated.
  • Figure 4: The progress of Bayesian optimization for both cases on both datasets.
  • Figure 5: Visualization of predictions. Three predicted samples are displayed for each model. Variations are exhibited in the walking motion, and all predictions are semantically consistent with the historical motion. No detectable degradation is observed in the prediction results of the distilled models. More animations can be found at https://github.com/sibotian96/SwiftDiff.