Table of Contents
Fetching ...

CCF: Cross Correcting Framework for Pedestrian Trajectory Prediction

Pranav Singh Chib, Pravendra Singh

TL;DR

CCF addresses uncertainty in pedestrian futures by learning robust spatio-temporal representations via a teacher-free cross-correcting framework. It employs two transformer-based subnets that mutually refine each other's predictions through cross-correction loss and a DNet-driven input diversification, augmented with a trajectory-classification auxiliary task. The training objective combines diversity, primary regression, secondary classification, and cross-correction terms as $\\mathcal{L}_{total} = \\mathcal{L}_{div} + \\mathcal{L}_{subnet,A} + \\mathcal{L}_{subnet,B} + \\lambda(\\mathcal{L}_{cor,A} + \\mathcal{L}_{cor,B})$, and evaluation uses a single subnet. Experiments on ETH-UCY and SDD show state-of-the-art or competitive ADE/FDE scores, validating the effectiveness of inter-subnet correction and input diversity for multi-agent trajectory prediction.

Abstract

Accurately predicting future pedestrian trajectories is crucial across various domains. Due to the uncertainty in future pedestrian trajectories, it is important to learn complex spatio-temporal representations in multi-agent scenarios. To address this, we propose a novel Cross-Correction Framework (CCF) to learn spatio-temporal representations of pedestrian trajectories better. Our framework consists of two trajectory prediction models, known as subnets, which share the same architecture and are trained with both cross-correction loss and trajectory prediction loss. Cross-correction leverages the learning from both subnets and enables them to refine their underlying representations of trajectories through a mutual correction mechanism. Specifically, we use the cross-correction loss to learn how to correct each other through an inter-subnet interaction. To induce diverse learning among the subnets, we use the transformed observed trajectories produced by a neural network as input to one subnet and the original observed trajectories as input to the other subnet. We utilize transformer-based encoder-decoder architecture for each subnet to capture motion and social interaction among pedestrians. The encoder of the transformer captures motion patterns in trajectories, while the decoder focuses on pedestrian interactions with neighbors. Each subnet performs the primary task of predicting future trajectories (a regression task) along with the secondary task of classifying the predicted trajectories (a classification task). Extensive experiments on real-world benchmark datasets such as ETH-UCY and SDD demonstrate the efficacy of our proposed framework, CCF, in precisely predicting pedestrian future trajectories. We also conducted several ablation experiments to demonstrate the effectiveness of various modules and loss functions used in our approach.

CCF: Cross Correcting Framework for Pedestrian Trajectory Prediction

TL;DR

CCF addresses uncertainty in pedestrian futures by learning robust spatio-temporal representations via a teacher-free cross-correcting framework. It employs two transformer-based subnets that mutually refine each other's predictions through cross-correction loss and a DNet-driven input diversification, augmented with a trajectory-classification auxiliary task. The training objective combines diversity, primary regression, secondary classification, and cross-correction terms as , and evaluation uses a single subnet. Experiments on ETH-UCY and SDD show state-of-the-art or competitive ADE/FDE scores, validating the effectiveness of inter-subnet correction and input diversity for multi-agent trajectory prediction.

Abstract

Accurately predicting future pedestrian trajectories is crucial across various domains. Due to the uncertainty in future pedestrian trajectories, it is important to learn complex spatio-temporal representations in multi-agent scenarios. To address this, we propose a novel Cross-Correction Framework (CCF) to learn spatio-temporal representations of pedestrian trajectories better. Our framework consists of two trajectory prediction models, known as subnets, which share the same architecture and are trained with both cross-correction loss and trajectory prediction loss. Cross-correction leverages the learning from both subnets and enables them to refine their underlying representations of trajectories through a mutual correction mechanism. Specifically, we use the cross-correction loss to learn how to correct each other through an inter-subnet interaction. To induce diverse learning among the subnets, we use the transformed observed trajectories produced by a neural network as input to one subnet and the original observed trajectories as input to the other subnet. We utilize transformer-based encoder-decoder architecture for each subnet to capture motion and social interaction among pedestrians. The encoder of the transformer captures motion patterns in trajectories, while the decoder focuses on pedestrian interactions with neighbors. Each subnet performs the primary task of predicting future trajectories (a regression task) along with the secondary task of classifying the predicted trajectories (a classification task). Extensive experiments on real-world benchmark datasets such as ETH-UCY and SDD demonstrate the efficacy of our proposed framework, CCF, in precisely predicting pedestrian future trajectories. We also conducted several ablation experiments to demonstrate the effectiveness of various modules and loss functions used in our approach.
Paper Structure (19 sections, 9 equations, 3 figures, 4 tables)

This paper contains 19 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The illustration of the CCF framework consists of two transformer-based trajectory prediction networks (SubNet A and SubNet B). SubNet A receives the original observed trajectory ($X_i$), while SubNet B is provided with a diverse version (${X_i}^{\prime}$) of $X_i$. First, Gaussian noise is added to $X_i$, and then it is given as an input to DNet to get a diversified version (${X_i}^{\prime}$). The original observed trajectory $X_i$ and the transformed trajectory ${X_i}^{\prime}$ are separately concatenated (symbol C) with trajectory classes given as inputs to SubNet A and SubNet B. The cross-correction mechanism is used for inter-subnet interactions. Here, $PCP_{A,i}$ and $PCP_{B,i}$ are the predicted class probabilities by SubNet A and SubNet B, respectively, and $GCP_{i}$ is the ground truth class probabilities. $\mathcal{L}_{\text{traj}}$ is the trajectory prediction loss, and $\mathcal{L}_{\text{cor}}$ is the cross-correction loss.
  • Figure 2: Illustration of the transformer-based subnet. The observed trajectory is concatenated (symbol C) with trajectory classes and given as an input to a linear layer. The output of the linear layer is added with position encoding to get input embedding. This input embedding is passed as an input to the encoder of the transformer. The output of the encoder is the predicted class probabilities. Next, the output of the encoder, along with the neighboring observed trajectories embedding, is fed to the decoder to capture the social interaction of the pedestrian and their neighbors. Finally, the decoder outputs the predicted future trajectory. Here, K denotes the number of trajectory classes, D denotes the embedding size, PE refers to the position encoding, and N indicates the number of neighbors. $L_E$ and $L_D$ represent the number of layers in the encoder and decoder, respectively.
  • Figure 3: Illustration of the predicted pedestrian trajectories on the ETH-UCY and SDD datasets. Predicted pedestrian trajectories from our approach are depicted in orange, observed trajectories in blue, and ground truth trajectories in green. Our approach accurately predicts future trajectories.