A Mixture of Experts Approach to 3D Human Motion Prediction

Edmund Shieh; Joshua Lee Franco; Kang Min Bae; Tej Lalvani

A Mixture of Experts Approach to 3D Human Motion Prediction

Edmund Shieh, Joshua Lee Franco, Kang Min Bae, Tej Lalvani

TL;DR

The paper tackles real-time 3D human motion prediction by evaluating the Spatial-Temporal Transformer and introducing a Soft Mixture-of-Experts (MoE) block within the ST attention to boost inference speed without sacrificing accuracy. It reproduces the state-of-the-art ST Transformer and benchmarks a Soft MoE-ST variant on the AMASS dataset, using axis-angle joint representations and autoregressive prediction over short horizons. Key contributions include a detailed architecture that replaces dense feed-forward networks with a Soft MoE, analysis of hyperparameters, and an empirical demonstration that MoE can scale model capacity with limited inference-time overhead. The findings suggest MoE-based ST Transformers offer a viable path to high-capacity, real-time motion prediction for applications in autonomous systems, robotics, and interactive environments, with publicly available code for replication.

Abstract

This project addresses the challenge of human motion prediction, a critical area for applications such as au- tonomous vehicle movement detection. Previous works have emphasized the need for low inference times to provide real time performance for applications like these. Our primary objective is to critically evaluate existing model ar- chitectures, identifying their advantages and opportunities for improvement by replicating the state-of-the-art (SOTA) Spatio-Temporal Transformer model as best as possible given computational con- straints. These models have surpassed the limitations of RNN-based models and have demonstrated the ability to generate plausible motion sequences over both short and long term horizons through the use of spatio-temporal rep- resentations. We also propose a novel architecture to ad- dress challenges of real time inference speed by incorpo- rating a Mixture of Experts (MoE) block within the Spatial- Temporal (ST) attention layer. The particular variation that is used is Soft MoE, a fully-differentiable sparse Transformer that has shown promising ability to enable larger model capacity at lower inference cost. We make out code publicly available at https://github.com/edshieh/motionprediction

A Mixture of Experts Approach to 3D Human Motion Prediction

TL;DR

Abstract

Paper Structure (25 sections, 9 equations, 14 figures, 7 tables)

This paper contains 25 sections, 9 equations, 14 figures, 7 tables.

Introduction
Approach
Challenges
Experiments and Results
Success Criteria
Experiment 1
Base Model Performance Analysis
Hyper-Parameter tuning for ST Transformer
Experiment 2
ST Transformer vs MoE Inference Time Ablation
ST Transformer vs MoE evaluation
Project Code Repository
Model Specifics
Positional Encoding:
Datatype Precision:
...and 10 more sections

Figures (14)

Figure 1: Temporal Attention map of temporal attention weights by timesteps in joints 6, 12, 18, 24
Figure 2: Spatial Attention map of spatial attention weights by joints 30, 60, 90, 120 timesteps
Figure 3: MoE ST Transformer Dispatch Weights Dispatch weights extracted from the MoE layer in a single forward pass of the trained MoE ST Transformer on an input sequence from the test split. The weights demonstrate a clear routing mechanism for the data to be processed by a specialized expert.
Figure 4: Architecture Overview Adapted version of ST transformer from Figure 2 in aksan2020spatiotemporal with feed-forward layers in attention layers swapped with MoE block (highlighted in pink)
Figure 5: Graphical visualization of table \ref{['tab:inference']} to demonstrate scalability of MoE and ability to handle large number of parameters
...and 9 more figures

A Mixture of Experts Approach to 3D Human Motion Prediction

TL;DR

Abstract

A Mixture of Experts Approach to 3D Human Motion Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (14)