Disentangled Neural Relational Inference for Interpretable Motion Prediction
Victoria M. Dax, Jiachen Li, Enna Sachdeva, Nakul Agarwal, Mykel J. Kochenderfer
TL;DR
The paper tackles the problem of interpretable and robust multi-agent motion prediction under distribution shifts. It introduces dG-VAE, a variational auto-encoder that learns dynamic interaction graphs with edge features and employs disentangled latent spaces to separate time-invariant factors from temporal dynamics. Key contributions include augmenting latent graphs with meaningful edge attributes, applying supervised and unsupervised disentanglement techniques, and demonstrating superior performance across diverse datasets (NBA, Spring, Motion Capture, inD) compared with strong baselines. The approach enhances both predictive accuracy and interpretability, supporting safer and more reliable autonomous systems in complex, interactive environments.
Abstract
Effective interaction modeling and behavior prediction of dynamic agents play a significant role in interactive motion planning for autonomous robots. Although existing methods have improved prediction accuracy, few research efforts have been devoted to enhancing prediction model interpretability and out-of-distribution (OOD) generalizability. This work addresses these two challenging aspects by designing a variational auto-encoder framework that integrates graph-based representations and time-sequence models to efficiently capture spatio-temporal relations between interactive agents and predict their dynamics. Our model infers dynamic interaction graphs in a latent space augmented with interpretable edge features that characterize the interactions. Moreover, we aim to enhance model interpretability and performance in OOD scenarios by disentangling the latent space of edge features, thereby strengthening model versatility and robustness. We validate our approach through extensive experiments on both simulated and real-world datasets. The results show superior performance compared to existing methods in modeling spatio-temporal relations, motion prediction, and identifying time-invariant latent features.
