Table of Contents
Fetching ...

Joint Perception and Prediction for Autonomous Driving: A Survey

Lucas Dal'Col, Miguel Oliveira, Vítor Santos

TL;DR

Autonomous driving systems must both perceive the environment and predict future agent behavior. This paper surveys the joint perception and prediction paradigm, introducing a taxonomy based on input representation, scene context, and output representation, and provides qualitative and quantitative analyses of 55 methods. It traces the evolution from BEV and range-view inputs to multi-representation fusion, explicit interaction modeling, and occupancy-based outputs, highlighting evaluations on nuScenes and related datasets. The survey identifies gaps such as radar utilization, inter-class interactions, uncertainty modeling, and unified metrics, offering directions for future research and practical implications for real-time autonomous driving. Overall, the work serves as a roadmap for researchers to design more integrated, efficient, and robust joint perception-prediction systems.

Abstract

Perception and prediction modules are critical components of autonomous driving systems, enabling vehicles to navigate safely through complex environments. The perception module is responsible for perceiving the environment, including static and dynamic objects, while the prediction module is responsible for predicting the future behavior of these objects. These modules are typically divided into three tasks: object detection, object tracking, and motion prediction. Traditionally, these tasks are developed and optimized independently, with outputs passed sequentially from one to the next. However, this approach has significant limitations: computational resources are not shared across tasks, the lack of joint optimization can amplify errors as they propagate throughout the pipeline, and uncertainty is rarely propagated between modules, resulting in significant information loss. To address these challenges, the joint perception and prediction paradigm has emerged, integrating perception and prediction into a unified model through multi-task learning. This strategy not only overcomes the limitations of previous methods, but also enables the three tasks to have direct access to raw sensor data, allowing richer and more nuanced environmental interpretations. This paper presents the first comprehensive survey of joint perception and prediction for autonomous driving. We propose a taxonomy that categorizes approaches based on input representation, scene context modeling, and output representation, highlighting their contributions and limitations. Additionally, we present a qualitative analysis and quantitative comparison of existing methods. Finally, we discuss future research directions based on identified gaps in the state-of-the-art.

Joint Perception and Prediction for Autonomous Driving: A Survey

TL;DR

Autonomous driving systems must both perceive the environment and predict future agent behavior. This paper surveys the joint perception and prediction paradigm, introducing a taxonomy based on input representation, scene context, and output representation, and provides qualitative and quantitative analyses of 55 methods. It traces the evolution from BEV and range-view inputs to multi-representation fusion, explicit interaction modeling, and occupancy-based outputs, highlighting evaluations on nuScenes and related datasets. The survey identifies gaps such as radar utilization, inter-class interactions, uncertainty modeling, and unified metrics, offering directions for future research and practical implications for real-time autonomous driving. Overall, the work serves as a roadmap for researchers to design more integrated, efficient, and robust joint perception-prediction systems.

Abstract

Perception and prediction modules are critical components of autonomous driving systems, enabling vehicles to navigate safely through complex environments. The perception module is responsible for perceiving the environment, including static and dynamic objects, while the prediction module is responsible for predicting the future behavior of these objects. These modules are typically divided into three tasks: object detection, object tracking, and motion prediction. Traditionally, these tasks are developed and optimized independently, with outputs passed sequentially from one to the next. However, this approach has significant limitations: computational resources are not shared across tasks, the lack of joint optimization can amplify errors as they propagate throughout the pipeline, and uncertainty is rarely propagated between modules, resulting in significant information loss. To address these challenges, the joint perception and prediction paradigm has emerged, integrating perception and prediction into a unified model through multi-task learning. This strategy not only overcomes the limitations of previous methods, but also enables the three tasks to have direct access to raw sensor data, allowing richer and more nuanced environmental interpretations. This paper presents the first comprehensive survey of joint perception and prediction for autonomous driving. We propose a taxonomy that categorizes approaches based on input representation, scene context modeling, and output representation, highlighting their contributions and limitations. Additionally, we present a qualitative analysis and quantitative comparison of existing methods. Finally, we discuss future research directions based on identified gaps in the state-of-the-art.

Paper Structure

This paper contains 24 sections, 5 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The proposed taxonomy of joint perception and prediction for autonomous driving.
  • Figure 2: Illustration of the input representations used in joint perception and prediction for autonomous driving: (a) bird's-eye-view, (b) multi-view images, (c) range-view, and (d) 3D voxel grid. Multi-representation is not depicted in this figure, as it simply involves using two or more of these representations. Figure created based on Wu2020Zhang2022Meyer2021Tong2023Tian2023.
  • Figure 3: Chronological overview of the joint perception and prediction approaches according to the input representation level of the taxonomy.
  • Figure 4: Illustration of the types of scene context modeling: (a) map modeling, (b) interaction modeling, and (c) trajectory modeling. Figure created based on Caesar2020Li2020Cui2021.
  • Figure 5: Chronological overview of the joint perception and prediction approaches according to the scene context level of the taxonomy: (a) map modeling, (b) interaction modeling, and (c) trajectory modeling.
  • ...and 2 more figures