Context-based Interpretable Spatio-Temporal Graph Convolutional Network for Human Motion Forecasting

Edgar Medina; Leyong Loh; Namrata Gurung; Kyung Hun Oh; Niels Heller

Context-based Interpretable Spatio-Temporal Graph Convolutional Network for Human Motion Forecasting

Edgar Medina, Leyong Loh, Namrata Gurung, Kyung Hun Oh, Niels Heller

TL;DR

This work addresses the challenge of predicting future 3D human poses while also offering interpretable insights into the learned spatio-temporal relationships. The authors introduce CIST-GCN, a context-based, interpretable spatio-temporal graph convolutional network that learns sample-specific adjacency and feature-importance representations through components such as DST-GCN, DAE, GaNet, and ConNet, with an Atrous Pyramid TCN decoder. Across Human3.6M, AMASS, 3DPW, and ExPI, the model achieves competitive or state-of-the-art MPJPE performance and demonstrates robustness to out-of-distribution perturbations, while providing explicit interpretability via feature importance vectors and maps. This combination of accuracy, robustness, and built-in explanations enhances practical applicability in real-world motion understanding and analysis.

Abstract

Human motion prediction is still an open problem extremely important for autonomous driving and safety applications. Due to the complex spatiotemporal relation of motion sequences, this remains a challenging problem not only for movement prediction but also to perform a preliminary interpretation of the joint connections. In this work, we present a Context-based Interpretable Spatio-Temporal Graph Convolutional Network (CIST-GCN), as an efficient 3D human pose forecasting model based on GCNs that encompasses specific layers, aiding model interpretability and providing information that might be useful when analyzing motion distribution and body behavior. Our architecture extracts meaningful information from pose sequences, aggregates displacements and accelerations into the input model, and finally predicts the output displacements. Extensive experiments on Human 3.6M, AMASS, 3DPW, and ExPI datasets demonstrate that CIST-GCN outperforms previous methods in human motion prediction and robustness. Since the idea of enhancing interpretability for motion prediction has its merits, we showcase experiments towards it and provide preliminary evaluations of such insights here. available code: https://github.com/QualityMinds/cistgcn

Context-based Interpretable Spatio-Temporal Graph Convolutional Network for Human Motion Forecasting

TL;DR

Abstract

Paper Structure (17 sections, 4 equations, 6 figures, 4 tables)

This paper contains 17 sections, 4 equations, 6 figures, 4 tables.

Introduction
Related work
Motion Prediction
Model Interpretability
Methodology
Problem Formalization
Review of GCN
Model architecture
Experimental Evaluation
Datasets
Experimental results
Computational Complexity
Implementation details
Discussion
Feature importance vectors
...and 2 more sections

Figures (6)

Figure 1: Illustration of our method. (a) Overview of the proposed CIST-GCN. $X$ and $\hat{X}$ are the input and output respectively. (b) The basic block of DST-GCN, (c) the Atraus Pyramid TCN, and (d) Context Network. More detailed, (e) Gating network weights the output of DSGN and DTGN, and (f) Dynamic Adjacency Encoder (DAE) to compute the adjacency matrices.
Figure 2: Motion prediction results on "walking" (top) and "eating" (bottom) motion classes from H3.6M dataset. Sorted by the lowest (left) and the largest errors (right). Solid lines are ground truth. Dashed lines are predictions from the $M32$ model. Blue color of the poses represents ground truth while the red color of the poses represents the predicted ones.
Figure 3: t-sne representation of the test set using (a) input poses (b) all feature importance from the model concatenated. MPJPE values are represented by scatter size.
Figure 4: Augmentation effect on test set evaluated on our pipeline. Average MPJPE over the 25 output frames, with (a) rotations between 0-360 degrees, and (b) noise rate between 0.0-0.2.
Figure 5: Normalized (0-1) and per-layer average adjacency matrices extracted from the CIST-GCN architecture in the spatial (left) and temporal (right) domains for (a) walking, and (b) other motion actions. The right parts display changes in the angles of movement.
...and 1 more figures

Context-based Interpretable Spatio-Temporal Graph Convolutional Network for Human Motion Forecasting

TL;DR

Abstract

Context-based Interpretable Spatio-Temporal Graph Convolutional Network for Human Motion Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (6)