Table of Contents
Fetching ...

Valeo4Cast: A Modular Approach to End-to-End Forecasting

Yihong Xu, Éloi Zablocki, Alexandre Boulch, Gilles Puy, Mickael Chen, Florent Bartoccioni, Nermin Samet, Oriane Siméoni, Spyros Gidaris, Tuan-Hung Vu, Andrei Bursuc, Eduardo Valle, Renaud Marlet, Matthieu Cord

TL;DR

This work uses a modular paradigm, which integrates finetuning strategies and significantly outperforms the end-to-end-trained counterparts, and ranks first in the Argoverse 2 End-to-end Forecasting Challenge.

Abstract

Motion forecasting is crucial in autonomous driving systems to anticipate the future trajectories of surrounding agents such as pedestrians, vehicles, and traffic signals. In end-to-end forecasting, the model must jointly detect and track from sensor data (cameras or LiDARs) the past trajectories of the different elements of the scene and predict their future locations. We depart from the current trend of tackling this task via end-to-end training from perception to forecasting, and instead use a modular approach. We individually build and train detection, tracking and forecasting modules. We then only use consecutive finetuning steps to integrate the modules better and alleviate compounding errors. We conduct an in-depth study on the finetuning strategies and it reveals that our simple yet effective approach significantly improves performance on the end-to-end forecasting benchmark. Consequently, our solution ranks first in the Argoverse 2 End-to-end Forecasting Challenge, with 63.82 mAPf. We surpass forecasting results by +17.1 points over last year's winner and by +13.3 points over this year's runner-up. This remarkable performance in forecasting can be explained by our modular paradigm, which integrates finetuning strategies and significantly outperforms the end-to-end-trained counterparts. The code, model weights and results are made available https://github.com/valeoai/valeo4cast.

Valeo4Cast: A Modular Approach to End-to-End Forecasting

TL;DR

This work uses a modular paradigm, which integrates finetuning strategies and significantly outperforms the end-to-end-trained counterparts, and ranks first in the Argoverse 2 End-to-end Forecasting Challenge.

Abstract

Motion forecasting is crucial in autonomous driving systems to anticipate the future trajectories of surrounding agents such as pedestrians, vehicles, and traffic signals. In end-to-end forecasting, the model must jointly detect and track from sensor data (cameras or LiDARs) the past trajectories of the different elements of the scene and predict their future locations. We depart from the current trend of tackling this task via end-to-end training from perception to forecasting, and instead use a modular approach. We individually build and train detection, tracking and forecasting modules. We then only use consecutive finetuning steps to integrate the modules better and alleviate compounding errors. We conduct an in-depth study on the finetuning strategies and it reveals that our simple yet effective approach significantly improves performance on the end-to-end forecasting benchmark. Consequently, our solution ranks first in the Argoverse 2 End-to-end Forecasting Challenge, with 63.82 mAPf. We surpass forecasting results by +17.1 points over last year's winner and by +13.3 points over this year's runner-up. This remarkable performance in forecasting can be explained by our modular paradigm, which integrates finetuning strategies and significantly outperforms the end-to-end-trained counterparts. The code, model weights and results are made available https://github.com/valeoai/valeo4cast.
Paper Structure (31 sections, 3 figures, 4 tables)

This paper contains 31 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the modular approach of Valeo4Cast. Conventional motion forecasting benchmarks provide curated annotations of past trajectories. Differently in this 'end-to-end forecasting' challenge, we opt for a modular approach where the past trajectories are predicted by the detection and tracking modules. The predicted results contain imperfections such as FPs, FNs, IDS, and localization errors, which hinder forecasting. For this reason, training only on curated data is not sufficient (top). We thus propose a finetuning strategy where we match the predicted results and ground-truth annotations. We finetune the model on the matched pairs (middle) and it shows significant improvements once the model is deployed in real-world end-to-end forecasting (bottom). The ego car, vehicles, and pedestrians are expressed in different colors. The past trajectories are shown with dotted lines and the future ones with plain lines. 'Pretrain' refers to the pretraining on the UniTraj feng2024unitraj framework, and 'Train' to the step where we keep training on the curated Argoverse2-Sensor dataset.
  • Figure 2: Per-class performance comparison of Valeo4Cast with and without pretraining. We show the per-class performance in mAP$_\text{f}$ of 26 classes in the Argoverse 2 sensor dataset for the end-to-end forecasting. The reported scores are on the validation set. The evaluation is conducted in a 50m-range around the ego-car.
  • Figure 3: Qualitative visualizations of randomly sampled frames of the validation set. The ego car, vehicles, wheeled devices, pedestrians and ground-truth annotations are expressed in different colors. The numbers represent the detection scores.