Table of Contents
Fetching ...

Adaptive Human Trajectory Prediction via Latent Corridors

Neerja Thakkar, Karttikeya Mangalam, Andrea Bajcsy, Jitendra Malik

TL;DR

The paper tackles the problem of adapting pre-trained human trajectory predictors to scene-specific, transient behaviors that arise in deployment. It introduces latent corridors, lightweight image-space prompts that are learned per deployment scene and added to the input heatmaps of a frozen base predictor, enabling data-efficient adaptation with minimal parameter overhead ($<$0.1\%$). The approach yields substantial gains in ADE across MOTSynth ($up to 23.9\%$) and real datasets (MOT/WildTrack up to 16.4\%, EarthCam up to 26.8\%), with additional benefits when combined with per-scene finetuning; the method also extends to architectures beyond YNet (e.g., PECNet-Ours achieving $10.2\%$ ADE improvement). Overall, latent corridors enable on-device, continual adaptation to changing scene context and transient events, improving ground-plane awareness and scene-specific pedestrian behaviors in a data-efficient manner.

Abstract

Human trajectory prediction is typically posed as a zero-shot generalization problem: a predictor is learnt on a dataset of human motion in training scenes, and then deployed on unseen test scenes. While this paradigm has yielded tremendous progress, it fundamentally assumes that trends in human behavior within the deployment scene are constant over time. As such, current prediction models are unable to adapt to scene-specific transient human behaviors, such as crowds temporarily gathering to see buskers, pedestrians hurrying through the rain and avoiding puddles, or a protest breaking out. We formalize the problem of scene-specific adaptive trajectory prediction and propose a new adaptation approach inspired by prompt tuning called latent corridors. By augmenting the input of any pre-trained human trajectory predictor with learnable image prompts, the predictor can improve in the deployment scene by inferring trends from extremely small amounts of new data (e.g., 2 humans observed for 30 seconds). With less than 0.1% additional model parameters, we see up to 23.9% ADE improvement in MOTSynth simulated data and 16.4% ADE in MOT and Wildtrack real pedestrian data. Qualitatively, we observe that latent corridors imbue predictors with an awareness of scene geometry and scene-specific human behaviors that non-adaptive predictors struggle to capture. The project website can be found at https://neerja.me/atp_latent_corridors/.

Adaptive Human Trajectory Prediction via Latent Corridors

TL;DR

The paper tackles the problem of adapting pre-trained human trajectory predictors to scene-specific, transient behaviors that arise in deployment. It introduces latent corridors, lightweight image-space prompts that are learned per deployment scene and added to the input heatmaps of a frozen base predictor, enabling data-efficient adaptation with minimal parameter overhead (0.1\%up to 23.9\%10.2\%$ ADE improvement). Overall, latent corridors enable on-device, continual adaptation to changing scene context and transient events, improving ground-plane awareness and scene-specific pedestrian behaviors in a data-efficient manner.

Abstract

Human trajectory prediction is typically posed as a zero-shot generalization problem: a predictor is learnt on a dataset of human motion in training scenes, and then deployed on unseen test scenes. While this paradigm has yielded tremendous progress, it fundamentally assumes that trends in human behavior within the deployment scene are constant over time. As such, current prediction models are unable to adapt to scene-specific transient human behaviors, such as crowds temporarily gathering to see buskers, pedestrians hurrying through the rain and avoiding puddles, or a protest breaking out. We formalize the problem of scene-specific adaptive trajectory prediction and propose a new adaptation approach inspired by prompt tuning called latent corridors. By augmenting the input of any pre-trained human trajectory predictor with learnable image prompts, the predictor can improve in the deployment scene by inferring trends from extremely small amounts of new data (e.g., 2 humans observed for 30 seconds). With less than 0.1% additional model parameters, we see up to 23.9% ADE improvement in MOTSynth simulated data and 16.4% ADE in MOT and Wildtrack real pedestrian data. Qualitatively, we observe that latent corridors imbue predictors with an awareness of scene geometry and scene-specific human behaviors that non-adaptive predictors struggle to capture. The project website can be found at https://neerja.me/atp_latent_corridors/.
Paper Structure (28 sections, 3 equations, 9 figures, 6 tables)

This paper contains 28 sections, 3 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Adaptive trajectory prediction. (left) Given a history of human behavior (shown in black), the pre-trained predictor $\mathcal{P}$ is unable to understand deployment scene-specific behavior trends, like people entering a subterranean subway entrance (bottom row) or mostly choosing to traverse the staircase as opposed to exploring other parts of the scene at nighttime (top row). (right) When adapting, the number of people and amount of time determine the total number of trajectories observed, and we denote this time-dependent quantity human-seconds. Here, the three columns correspond to our method trained for a very small (left), medium (middle), and large amount of human-seconds (right). Our adaptive latent corridors approach enables $\mathcal{P}$ to quickly learn context-specific trends, improving predictions with even small amounts of data, and closing the gap between the ground-truth (green) and predicted behavior (orange). For example, in the middle row, $\mathcal{P}$ predicts the person will move towards the camera, but as our method sees more human-seconds of data, it adapts to the trend that at this point of scene capture in the plaza, people tend to avoid the center of the plaza and instead move diagonally across it.
  • Figure 1: Adaptive trajectory prediction. ATP, formulated in Sec. 3, allows a pre-trained predictor $\mathcal{P}$ to adapt to a new deployment scene by learning over time on the deployment scene. Once adaptation has occurred, the adapted predictor $\mathcal{A}[\mathcal{P}]$ should perform better on the deployment scene.
  • Figure 2: Adapting a predictor $\mathcal{P}$ with latent corridors.$\mathcal{P}_E$, $\mathcal{P}_D$ and $\mathcal{P}_P$ are pre-trained on the task of human trajectory prediction, taking as input trajectory heatmaps $\mathbf{M}_{\tau-H:\tau}$ and segmentation $S$, and outputting predicted trajectory heatmaps $\mathbf{M}_{\tau+1:\tau + T}$. We augment $\mathcal{P}$ with a per-scene latent corridor$p$ which is summed element-wise to the input trajectory heatmaps. The latent corridors are trained with $\mathcal{P}_E$ and $\mathcal{P}_D$ frozen. The predictor head $\mathcal{P}_P$ can be frozen, tuned on a single deployment scene, or tuned jointly across multiple scenes.
  • Figure 2: Adaptation over time FDE. As in main text Fig. 5, the x-axis represents normalized adaptation time in human-seconds. The y-axis represents the FDE (lower is better). Results are normalized per-scene and averaged over models trained on 25 MOTSynth scenes (a) and 7 from MOT and WildTrack (b), with shaded area $\sigma/10$. For the FDE metric, our methods improve on the baselines increasingly with adaptation time. Latent corridors + per-scene finetuning has the best performance, as with FDE, and ATP via just finetuning or just latent corridor learning is still comparable. c) Comparison to baseline over many MOTSynth scenes for models trained with $8\%$ (point) and $80\%$ (arrowhead) human-second datasets for FDE. For many deployment scenes, FDE improves significantly more with our method than ADE improved, but still, the per-scene improvements are varied.
  • Figure 3: Qualitative results on MotSynth (top; synthetic) and MOT and WildTrack (bottom; real). These examples show scenarios where our LC + per-scene finetune ATP method (orange) outperforms the scene-aware baseline (purple). In several MOTSynth examples, the baseline predicts the pedestrian floats into the air (top row), while our method has gained awareness of where the 3D ground plane lies in the 2D image. We also note that patterns of behaviour such as walking on the sidewalk instead of into the road (second row left) and walking up the traversable portion of stairs (second row right) are captured. On real data, we observe similar awareness of the ground plane and obstacles, as well as a better understanding of nuanced human behavior patterns such as crossing diagonally across a plaza.
  • ...and 4 more figures