Table of Contents
Fetching ...

Disentangled Neural Relational Inference for Interpretable Motion Prediction

Victoria M. Dax, Jiachen Li, Enna Sachdeva, Nakul Agarwal, Mykel J. Kochenderfer

TL;DR

The paper tackles the problem of interpretable and robust multi-agent motion prediction under distribution shifts. It introduces dG-VAE, a variational auto-encoder that learns dynamic interaction graphs with edge features and employs disentangled latent spaces to separate time-invariant factors from temporal dynamics. Key contributions include augmenting latent graphs with meaningful edge attributes, applying supervised and unsupervised disentanglement techniques, and demonstrating superior performance across diverse datasets (NBA, Spring, Motion Capture, inD) compared with strong baselines. The approach enhances both predictive accuracy and interpretability, supporting safer and more reliable autonomous systems in complex, interactive environments.

Abstract

Effective interaction modeling and behavior prediction of dynamic agents play a significant role in interactive motion planning for autonomous robots. Although existing methods have improved prediction accuracy, few research efforts have been devoted to enhancing prediction model interpretability and out-of-distribution (OOD) generalizability. This work addresses these two challenging aspects by designing a variational auto-encoder framework that integrates graph-based representations and time-sequence models to efficiently capture spatio-temporal relations between interactive agents and predict their dynamics. Our model infers dynamic interaction graphs in a latent space augmented with interpretable edge features that characterize the interactions. Moreover, we aim to enhance model interpretability and performance in OOD scenarios by disentangling the latent space of edge features, thereby strengthening model versatility and robustness. We validate our approach through extensive experiments on both simulated and real-world datasets. The results show superior performance compared to existing methods in modeling spatio-temporal relations, motion prediction, and identifying time-invariant latent features.

Disentangled Neural Relational Inference for Interpretable Motion Prediction

TL;DR

The paper tackles the problem of interpretable and robust multi-agent motion prediction under distribution shifts. It introduces dG-VAE, a variational auto-encoder that learns dynamic interaction graphs with edge features and employs disentangled latent spaces to separate time-invariant factors from temporal dynamics. Key contributions include augmenting latent graphs with meaningful edge attributes, applying supervised and unsupervised disentanglement techniques, and demonstrating superior performance across diverse datasets (NBA, Spring, Motion Capture, inD) compared with strong baselines. The approach enhances both predictive accuracy and interpretability, supporting safer and more reliable autonomous systems in complex, interactive environments.

Abstract

Effective interaction modeling and behavior prediction of dynamic agents play a significant role in interactive motion planning for autonomous robots. Although existing methods have improved prediction accuracy, few research efforts have been devoted to enhancing prediction model interpretability and out-of-distribution (OOD) generalizability. This work addresses these two challenging aspects by designing a variational auto-encoder framework that integrates graph-based representations and time-sequence models to efficiently capture spatio-temporal relations between interactive agents and predict their dynamics. Our model infers dynamic interaction graphs in a latent space augmented with interpretable edge features that characterize the interactions. Moreover, we aim to enhance model interpretability and performance in OOD scenarios by disentangling the latent space of edge features, thereby strengthening model versatility and robustness. We validate our approach through extensive experiments on both simulated and real-world datasets. The results show superior performance compared to existing methods in modeling spatio-temporal relations, motion prediction, and identifying time-invariant latent features.
Paper Structure (10 sections, 8 equations, 10 figures, 4 tables)

This paper contains 10 sections, 8 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: The encoder evaluates edge features, a section of which is used to increase interpretability through disentanglement, such as restricted labeling or pair matching.
  • Figure 2: Encoder architecture that learns the prior.
  • Figure 3: Variations of disentanglement.
  • Figure 4: (NBA) Trajectory samples predicted by different models. The grey lines represent ground truth trajectories and the blue and green lines show the predictions for home and visiting teams. The purple lines represent the basketball.
  • Figure 5: ($k$-Spring) Trajectory samples predicted by different models. Each color represents a different point mass and the grey lines represent the ground truth trajectories.
  • ...and 5 more figures