Table of Contents
Fetching ...

EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations

Jiayi Liu, Jiaming Zhou, Ke Ye, Kun-Yu Lin, Allan Wang, Junwei Liang

TL;DR

This work tackles robust trajectory forecasting from ego-centric observations by introducing EgoTraj-Bench, the first real-world benchmark that injects authentic ego-view noise into BEV-grounded supervision. It proposes BiFlow, a dual-stream flow-matching model with EgoAnchor that jointly denoises histories and predicts futures, leveraging a shared latent encoder and intention priors to stabilize predictions under occlusion, ID switches, and drift. Empirical results show BiFlow achieving state-of-the-art performance and improved robustness (reducing errors by approximately 10–15% on average) across real-world EgoTraj-Bench and related datasets. The benchmark and method together provide a practical foundation for deploying ego-centric trajectory prediction systems robust to deployment-level perceptual disturbances.

Abstract

Reliable trajectory prediction from an ego-centric perspective is crucial for robotic navigation in human-centric environments. However, existing methods typically assume idealized observation histories, failing to account for the perceptual artifacts inherent in first-person vision, such as occlusions, ID switches, and tracking drift. This discrepancy between training assumptions and deployment reality severely limits model robustness. To bridge this gap, we introduce EgoTraj-Bench, the first real-world benchmark that grounds noisy, first-person visual histories in clean, bird's-eye-view future trajectories, enabling robust learning under realistic perceptual constraints. Building on this benchmark, we propose BiFlow, a dual-stream flow matching model that concurrently denoises historical observations and forecasts future motion by leveraging a shared latent representation. To better model agent intent, BiFlow incorporates our EgoAnchor mechanism, which conditions the prediction decoder on distilled historical features via feature modulation. Extensive experiments show that BiFlow achieves state-of-the-art performance, reducing minADE and minFDE by 10-15% on average and demonstrating superior robustness. We anticipate that our benchmark and model will provide a critical foundation for developing trajectory forecasting systems truly resilient to the challenges of real-world, ego-centric perception.

EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations

TL;DR

This work tackles robust trajectory forecasting from ego-centric observations by introducing EgoTraj-Bench, the first real-world benchmark that injects authentic ego-view noise into BEV-grounded supervision. It proposes BiFlow, a dual-stream flow-matching model with EgoAnchor that jointly denoises histories and predicts futures, leveraging a shared latent encoder and intention priors to stabilize predictions under occlusion, ID switches, and drift. Empirical results show BiFlow achieving state-of-the-art performance and improved robustness (reducing errors by approximately 10–15% on average) across real-world EgoTraj-Bench and related datasets. The benchmark and method together provide a practical foundation for deploying ego-centric trajectory prediction systems robust to deployment-level perceptual disturbances.

Abstract

Reliable trajectory prediction from an ego-centric perspective is crucial for robotic navigation in human-centric environments. However, existing methods typically assume idealized observation histories, failing to account for the perceptual artifacts inherent in first-person vision, such as occlusions, ID switches, and tracking drift. This discrepancy between training assumptions and deployment reality severely limits model robustness. To bridge this gap, we introduce EgoTraj-Bench, the first real-world benchmark that grounds noisy, first-person visual histories in clean, bird's-eye-view future trajectories, enabling robust learning under realistic perceptual constraints. Building on this benchmark, we propose BiFlow, a dual-stream flow matching model that concurrently denoises historical observations and forecasts future motion by leveraging a shared latent representation. To better model agent intent, BiFlow incorporates our EgoAnchor mechanism, which conditions the prediction decoder on distilled historical features via feature modulation. Extensive experiments show that BiFlow achieves state-of-the-art performance, reducing minADE and minFDE by 10-15% on average and demonstrating superior robustness. We anticipate that our benchmark and model will provide a critical foundation for developing trajectory forecasting systems truly resilient to the challenges of real-world, ego-centric perception.

Paper Structure

This paper contains 21 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of key challenges under ego-view observations. Top row: Occlusion. In the first-person view (a), only one pedestrian (green box) is visible due to occlusion; the corresponding BEV (b) shows two additional agents (pink and yellow) who are behind the green pedestrian. Dashed lines indicate trajectories visible in BEV but not in FPV. Bottom row: ID Switch and Perspective Distortion. Two pedestrians (yellow and pink) cross paths, causing an ID swap in the FPV tracking output (a). Additionally, individuals near the image corners suffer from significant perspective distortion, making accurate localization challenging.
  • Figure 2: EgoTraj-Bench Overview: Left Synchronized BEV and FPV videos are obtained from the dataset. Blue box marks a temporally aligned frame. Mid Clean past and future trajectories are extracted from BEV annotations as ground truth, while noisy historical observations are projected from FPV videos. Right The noisy ego-view histories are paired with ground truth, enabling robust evaluation under realistic ego-centric conditions. A mask is also generated based on history visibility.
  • Figure 3: Overview of our BiFlow. The input consists of a noisy historical trajectory $\tilde{X}$ and its corresponding visibility mask $m$. During training, the model is supervised with clean ground-truth past $X$ and future $Y$ trajectories to jointly learn reconstruction and prediction. At inference, only the noisy history and mask are used as input to predict the future trajectory $\hat{Y}$.
  • Figure 4: Qualitative Results. Solid lines represents the ground truth trajectories, while dashed lines shows the predicted trajectories.