SSL-Interactions: Pretext Tasks for Interactive Trajectory Prediction

Prarthana Bhattacharyya; Chengjie Huang; Krzysztof Czarnecki

SSL-Interactions: Pretext Tasks for Interactive Trajectory Prediction

Prarthana Bhattacharyya, Chengjie Huang, Krzysztof Czarnecki

TL;DR

This work tackles interactive trajectory forecasting for multi-agent scenes by introducing SSL-Interactions, a self-supervised framework that decomposes joint dynamics into a scalable marginal predictor plus interaction-focused pretext tasks. Four interaction-aware tasks—range-gap, closest-distance, direction of movement, and type of interaction—are trained alongside the main forecast, using pseudo-labeled interacting pairs curated from the data. The approach yields consistent improvements over a state-of-the-art baseline, particularly in interactive scenarios, with up to 8% gains on proposed metrics like i-minFDE_6 and CAM_6, while maintaining competitive performance on non-interactive data. The study also contributes a practical data-curation method and new evaluation metrics tailored to interaction-rich scenes, advancing the practical deployment of motion forecasting in safety-critical autonomous driving applications.

Abstract

This paper addresses motion forecasting in multi-agent environments, pivotal for ensuring safety of autonomous vehicles. Traditional as well as recent data-driven marginal trajectory prediction methods struggle to properly learn non-linear agent-to-agent interactions. We present SSL-Interactions that proposes pretext tasks to enhance interaction modeling for trajectory prediction. We introduce four interaction-aware pretext tasks to encapsulate various aspects of agent interactions: range gap prediction, closest distance prediction, direction of movement prediction, and type of interaction prediction. We further propose an approach to curate interaction-heavy scenarios from datasets. This curated data has two advantages: it provides a stronger learning signal to the interaction model, and facilitates generation of pseudo-labels for interaction-centric pretext tasks. We also propose three new metrics specifically designed to evaluate predictions in interactive scenes. Our empirical evaluations indicate SSL-Interactions outperforms state-of-the-art motion forecasting methods quantitatively with up to 8% improvement, and qualitatively, for interaction-heavy scenarios.

SSL-Interactions: Pretext Tasks for Interactive Trajectory Prediction

TL;DR

Abstract

Paper Structure (28 sections, 18 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 28 sections, 18 equations, 5 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Marginal prediction
Joint prediction
Prediction with Pretext Tasks
Problem Formulation
SSL-Interactions
Labeling Interactions
Proposed Pretext Tasks
Range-gap Prediction
Closest-distance Prediction
Direction of Movement Prediction
Type of Interaction Prediction
Training Scheme
Experiments
...and 13 more sections

Figures (5)

Figure 1: Comparison of approaches for training multi-agent forecasting systems. (a) Interactions between agents are not considered explicitly. Each node represents an agent's state, and arrows denote the aggregation of these marginal distributions $\text{P}_0(\boldsymbol{Y})$. (b) Interactions between all pairs of agents are considered. The nodes represent the random variables, and the bidirectional edges between every pair of nodes denote the dependencies among all pairs of random variables. $P(\boldsymbol{Y})$ indicates the estimation of the full joint interaction distribution. (c) SSL-Interactions considers interactions between pseudo-labeled pairs of agents. The nodes represent the random variables, grouped by interaction-specific dependencies in subset $A_i$ enclosed by dashed-line rectangles. $P(\boldsymbol{T}|A_i)$ represents the conditional distributions for pretext task. This is used to train the interaction-module in a self-supervised setup. $P_0(\boldsymbol{Y})$ captures the marginal distributions of all the agents.
Figure 2: Illustration of the proposed data curation method for explicitly labeling pairwise interactions. The future trajectory of the target agent is denoted in yellow, while the past inputs are given in black. The first step involves identifying agents within a specified distance threshold, indicated by the violet color. Nonetheless, only distance thresholding is inadequate, as vehicles moving in opposite directions frequently do not interact. In the second step, oncoming agents are filtered out if the target agent intends to proceed straight, but are retained if the target's intended action is a left turn. The final interacting agents' future trajectories are given in red.
Figure 3: Schematic diagram of the proposed model incorporating a pretext task as an auxiliary loss. The agent encoder processes each agent's observed trajectory, while the map encoder handles high-definition (HD) map encoding. These representations are initially passed to a context encoder, generating map-conditioned agent features. These are subsequently processed by an agent-to-agent attention-based encoder, which encodes inter-agent dependencies. This comprehensive representation informs both the future trajectory decoder and the proposed pretext task component. The pretext task loss, benefiting from a stop gradient, exclusively trains the agent-to-agent encoder, ensuring only interaction-specific features are harnessed by the pretext tasks. Consequently, any improvements can be specifically attributed to enhanced interaction modeling within the agent-to-agent encoder. The pretext task loss, serving as an auxiliary task, is discarded during the inference phase.
Figure 4: Our proposed interaction-based pretext tasks, range-gap prediction(top-left), closest-distance prediction(top-right), direction of movement prediction(bottom-left) and type of interaction prediction(bottom-right), as described in \ref{['sec:chap-6-proposed pretext task']}.
Figure 5: Motion forecasting on curated, interaction-heavy Argoverse Argoverse validation. We present four challenging scenarios for analysis. The first column depicts the scene featuring the target agent and the curated interactive agents. The second column contains predictions made by the baseline model Lanegcn. The third column displays the predictions of our proposed model when the connections to the interactive agents are disconnected. The fourth and final column features predictions from our model when regularized with the pretext loss. The baseline model fails to accurately forecast any of the scenarios. The first row illustrates a case where the predicted range-gap accurately anticipates at least one future trajectory that evades collision with the forward vehicle. The second row presents a congested situation in which forecasting the closest-distance with interacting agents leads to a future trajectory devoid of collision with the vehicle in front. The third row depicts a similar situation, but in the context of predicting direction of movement - both the baseline and the model without interacting agent information result in a collision course with the agent ahead, whereas our model proposes a trajectory closely aligned with the ground truth, avoiding collision.The final row presents a turning scenario wherein the type of interaction is predicted. Our proposed model forecasts a 'close-follow' situation with the interacting vehicle and identifies a potential future trajectory that is most closely aligned with the ground truth, while the baseline and the model without knowledge of interacting agents fail in terms of the $\text{MR}_6$ metric. Please refer to \ref{['sec:chap-6-proposed pretext task']} for details.

SSL-Interactions: Pretext Tasks for Interactive Trajectory Prediction

TL;DR

Abstract

SSL-Interactions: Pretext Tasks for Interactive Trajectory Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (5)