Table of Contents
Fetching ...

Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?

Yihong Xu, Loïck Chambon, Éloi Zablocki, Mickaël Chen, Alexandre Alahi, Matthieu Cord, Patrick Pérez

TL;DR

This work tackles the discrepancy between motion forecasting trained on curated data and real-world deployment where perception modules provide imperfect inputs. It introduces a unified evaluation benchmark that integrates perception outputs with forecasting, enabling fair comparisons between conventional and end-to-end approaches. Across extensive experiments, end-to-end methods do not outperform conventional pipelines under the same perception inputs, and a large performance gap emerges when moving from curated maps and tracks to real perception outputs, driven largely by localization and detection errors rather than mere precision. The findings highlight the need for better map integration, robust handling of perception errors, and distance-aware evaluation to realize robust real-world motion forecasting, and provide an open-source benchmarking tool for future work.

Abstract

Motion forecasting is crucial in enabling autonomous vehicles to anticipate the future trajectories of surrounding agents. To do so, it requires solving mapping, detection, tracking, and then forecasting problems, in a multi-step pipeline. In this complex system, advances in conventional forecasting methods have been made using curated data, i.e., with the assumption of perfect maps, detection, and tracking. This paradigm, however, ignores any errors from upstream modules. Meanwhile, an emerging end-to-end paradigm, that tightly integrates the perception and forecasting architectures into joint training, promises to solve this issue. However, the evaluation protocols between the two methods were so far incompatible and their comparison was not possible. In fact, conventional forecasting methods are usually not trained nor tested in real-world pipelines (e.g., with upstream detection, tracking, and mapping modules). In this work, we aim to bring forecasting models closer to the real-world deployment. First, we propose a unified evaluation pipeline for forecasting methods with real-world perception inputs, allowing us to compare conventional and end-to-end methods for the first time. Second, our in-depth study uncovers a substantial performance gap when transitioning from curated to perception-based data. In particular, we show that this gap (1) stems not only from differences in precision but also from the nature of imperfect inputs provided by perception modules, and that (2) is not trivially reduced by simply finetuning on perception outputs. Based on extensive experiments, we provide recommendations for critical areas that require improvement and guidance towards more robust motion forecasting in the real world. The evaluation library for benchmarking models under standardized and practical conditions is provided: \url{https://github.com/valeoai/MFEval}.

Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?

TL;DR

This work tackles the discrepancy between motion forecasting trained on curated data and real-world deployment where perception modules provide imperfect inputs. It introduces a unified evaluation benchmark that integrates perception outputs with forecasting, enabling fair comparisons between conventional and end-to-end approaches. Across extensive experiments, end-to-end methods do not outperform conventional pipelines under the same perception inputs, and a large performance gap emerges when moving from curated maps and tracks to real perception outputs, driven largely by localization and detection errors rather than mere precision. The findings highlight the need for better map integration, robust handling of perception errors, and distance-aware evaluation to realize robust real-world motion forecasting, and provide an open-source benchmarking tool for future work.

Abstract

Motion forecasting is crucial in enabling autonomous vehicles to anticipate the future trajectories of surrounding agents. To do so, it requires solving mapping, detection, tracking, and then forecasting problems, in a multi-step pipeline. In this complex system, advances in conventional forecasting methods have been made using curated data, i.e., with the assumption of perfect maps, detection, and tracking. This paradigm, however, ignores any errors from upstream modules. Meanwhile, an emerging end-to-end paradigm, that tightly integrates the perception and forecasting architectures into joint training, promises to solve this issue. However, the evaluation protocols between the two methods were so far incompatible and their comparison was not possible. In fact, conventional forecasting methods are usually not trained nor tested in real-world pipelines (e.g., with upstream detection, tracking, and mapping modules). In this work, we aim to bring forecasting models closer to the real-world deployment. First, we propose a unified evaluation pipeline for forecasting methods with real-world perception inputs, allowing us to compare conventional and end-to-end methods for the first time. Second, our in-depth study uncovers a substantial performance gap when transitioning from curated to perception-based data. In particular, we show that this gap (1) stems not only from differences in precision but also from the nature of imperfect inputs provided by perception modules, and that (2) is not trivially reduced by simply finetuning on perception outputs. Based on extensive experiments, we provide recommendations for critical areas that require improvement and guidance towards more robust motion forecasting in the real world. The evaluation library for benchmarking models under standardized and practical conditions is provided: \url{https://github.com/valeoai/MFEval}.
Paper Structure (15 sections, 4 figures, 4 tables)

This paper contains 15 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Issues of deploying forecasting models to the real world. We show in a nuScenes example nuscenes the forecasts (in orange) inferred by a motion forecasting model yuan2021agentformer compared to ground-truth annotations (in green), the ego car location (in red) and the static vehicles (in gray) on predicted or curated maps. (a) Satisfying forecasting performance in a curated setting; (b) When past trajectories are inferred from tracking models zhu2019classWeng2020_AB3DMOT, an agent is not detected and the forecasting model yields poor predictions; (c) When the map is inferred online harley2022simple, the forecasting model does not anticipate the future turn of one agent.
  • Figure 2: Study overview. We study the challenges of deploying motion forecasting models into the real world when only predicted perception inputs are available. We compare (\ref{['sec.eval']}): (1) (top) 'conventional methods' yuan2021agentformerkim2021lapred (i.e., methods trained on curated input data) where (middle) we directly replace the curated inputs with real-world data, and (2) (bottom) 'end-to-end methods' gu2023vip3duniad that are trained and used with perception modules. In the real-world setting, evaluation is challenging as the past tracks are estimated with arbitrary identities, making it difficult to establish a direct correspondence to GT identities. Therefore, we propose a matching process (purple) to assign predictions to GT and thus evaluate forecasting performances (\ref{['sec.eval']}). Moreover, we study in depth the impact changing from curated data (green) to real-world (orange) mapping (\ref{['sec:map']}), or detection and tracking (\ref{['sec:det_track']}) errors to motion forecasting.
  • Figure 3: Impact of controlled input errors . Forecasting performance (mAP$_f$) under different proportions of detection and tracking errors ($x$-axis); We simulate misdetections (FN, in blue), false detections (FP@5meters, in green), localization errors (Loc. Error@2meters in orange) and tracking errors (IDS@5meters, in pink) in the past trajectories.
  • Figure 4: Impact of agent-ego distance. Tracking and forecasting performances w.r.t. agent-ego distance ($x$-axis in meters) for tracking methods: (camera-based) MUTR3D-R50, MUTR3D-R101, ViP3D, UniAD; (LiDAR-based) MegVii-AB3DMOT, CenterPoint, VoxelNext; GT-Tracking and GT.