Table of Contents
Fetching ...

UnityGraph: Unified Learning of Spatio-temporal features for Multi-person Motion Prediction

Kehua Qu, Rui Ding, Jin Tang

TL;DR

A novel graph structure, UnityGraph, is proposed, which treats spatio-temporal features as a whole, enhancing model coherence and coupling, and reformulates multi-person motion prediction into a problem on a single graph.

Abstract

Multi-person motion prediction is a complex and emerging field with significant real-world applications. Current state-of-the-art methods typically adopt dual-path networks to separately modeling spatial features and temporal features. However, the uncertain compatibility of the two networks brings a challenge for spatio-temporal features fusion and violate the spatio-temporal coherence and coupling of human motions by nature. To address this issue, we propose a novel graph structure, UnityGraph, which treats spatio-temporal features as a whole, enhancing model coherence and coupling.spatio-temporal features as a whole, enhancing model coherence and coupling. Specifically, UnityGraph is a hypervariate graph based network. The flexibility of the hypergraph allows us to consider the observed motions as graph nodes. We then leverage hyperedges to bridge these nodes for exploring spatio-temporal features. This perspective considers spatio-temporal dynamics unitedly and reformulates multi-person motion prediction into a problem on a single graph. Leveraging the dynamic message passing based on this hypergraph, our model dynamically learns from both types of relations to generate targeted messages that reflect the relevance among nodes. Extensive experiments on several datasets demonstrates that our method achieves state-of-the-art performance, confirming its effectiveness and innovative design.

UnityGraph: Unified Learning of Spatio-temporal features for Multi-person Motion Prediction

TL;DR

A novel graph structure, UnityGraph, is proposed, which treats spatio-temporal features as a whole, enhancing model coherence and coupling, and reformulates multi-person motion prediction into a problem on a single graph.

Abstract

Multi-person motion prediction is a complex and emerging field with significant real-world applications. Current state-of-the-art methods typically adopt dual-path networks to separately modeling spatial features and temporal features. However, the uncertain compatibility of the two networks brings a challenge for spatio-temporal features fusion and violate the spatio-temporal coherence and coupling of human motions by nature. To address this issue, we propose a novel graph structure, UnityGraph, which treats spatio-temporal features as a whole, enhancing model coherence and coupling.spatio-temporal features as a whole, enhancing model coherence and coupling. Specifically, UnityGraph is a hypervariate graph based network. The flexibility of the hypergraph allows us to consider the observed motions as graph nodes. We then leverage hyperedges to bridge these nodes for exploring spatio-temporal features. This perspective considers spatio-temporal dynamics unitedly and reformulates multi-person motion prediction into a problem on a single graph. Leveraging the dynamic message passing based on this hypergraph, our model dynamically learns from both types of relations to generate targeted messages that reflect the relevance among nodes. Extensive experiments on several datasets demonstrates that our method achieves state-of-the-art performance, confirming its effectiveness and innovative design.

Paper Structure

This paper contains 39 sections, 33 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Comparison between our method with single-person prediction methods zhong2022spatiodang2021msr, human trajectory prediction methods xu2022groupnetli2020evolvegraph and traditional multi-person motion prediction methods wang2021multixu2023jointpeng2023trajectory. (a)Single-person prediction methods focus on modeling joint relations, neglecting interactions within the group. (b)Human trajectory prediction methods lack the representation of 3D pose. (c)Traditional multi-person motion prediction methods employ multiple sub-networks to capture spatial and temporal features separately. These methods inevitably diminish spatio-temporal coupling and consistency. (d)Our method unifies the learning of spatio-temporal features within a single network for multi-person motion prediction. For clarity, edges that connect nodes across different frames are omitted.
  • Figure 2: Illustration of our motivation. In the past scene, person 2 is walking together with person 3, while person1 is walking towards person 2. Most current methods consider the interaction during this phase. (The red dash lines denote the interaction between different individuals.) In future scene 1 and 2, a sudden situation occurs-person1 meet person 2 and stops to talk. If we do not continually consider the existing interaction in the future, person 3 keeps walking and forgets about the person he was walking with, as shown in future scene 1. In contrast, if we think about interaction in the future, there would be a different result: person 3 should also stop and wait his partner, person 2, as shown in future scene 2. Our method is dedicated to making prediction that comply with scene 2.
  • Figure 3: The framework of our network. (i) Each motion representation of individual in obseved frame is considered as a node of the graph. And the various hyperedges denote relation in the temporal or spatial dimensions. Message passing propagates spatio-temporal features through these edges. (ii) The interactive decoding incorporates historical features, such as the last frame of observed motion and the motion sequence after message passing, along with updated relations through reasoning at each frame.
  • Figure 4: The illustration of nodes and hyperedges initialization. We regard the observed poses of $N$ persons as nodes of the graph and set different hyperedges to explore the relations between nodes for capture spatio-temporal features. (i) We associate the nodes of two adjacent frames with the short-term hyperedges for each individual. (ii) The long-term hyperedges consist of all nodes of time length $T$. (iii) The spatial hyperedges connect all nodes in the same frame. For clarity, some nodes and hyperedges are omitted in this figure.
  • Figure 5: The illustration of short-term and long-term hyperedges update. (a) On short-term, we select node $g_{t,l}^{n}$ and its neighbor $g_{t+1,l}^{n}$ to update the edge $(e_{1})_{t,l}^{n}$ which connects them. (b) We update long-term hyperedges by aggregating all nodes from $g_{t=1,l}^{n}$ to $g_{t=T,l}^{n}$.
  • ...and 7 more figures