Joint Perception and Prediction for Autonomous Driving: A Survey

Lucas Dal'Col; Miguel Oliveira; Vítor Santos

Joint Perception and Prediction for Autonomous Driving: A Survey

Lucas Dal'Col, Miguel Oliveira, Vítor Santos

TL;DR

Autonomous driving systems must both perceive the environment and predict future agent behavior. This paper surveys the joint perception and prediction paradigm, introducing a taxonomy based on input representation, scene context, and output representation, and provides qualitative and quantitative analyses of 55 methods. It traces the evolution from BEV and range-view inputs to multi-representation fusion, explicit interaction modeling, and occupancy-based outputs, highlighting evaluations on nuScenes and related datasets. The survey identifies gaps such as radar utilization, inter-class interactions, uncertainty modeling, and unified metrics, offering directions for future research and practical implications for real-time autonomous driving. Overall, the work serves as a roadmap for researchers to design more integrated, efficient, and robust joint perception-prediction systems.

Abstract

Perception and prediction modules are critical components of autonomous driving systems, enabling vehicles to navigate safely through complex environments. The perception module is responsible for perceiving the environment, including static and dynamic objects, while the prediction module is responsible for predicting the future behavior of these objects. These modules are typically divided into three tasks: object detection, object tracking, and motion prediction. Traditionally, these tasks are developed and optimized independently, with outputs passed sequentially from one to the next. However, this approach has significant limitations: computational resources are not shared across tasks, the lack of joint optimization can amplify errors as they propagate throughout the pipeline, and uncertainty is rarely propagated between modules, resulting in significant information loss. To address these challenges, the joint perception and prediction paradigm has emerged, integrating perception and prediction into a unified model through multi-task learning. This strategy not only overcomes the limitations of previous methods, but also enables the three tasks to have direct access to raw sensor data, allowing richer and more nuanced environmental interpretations. This paper presents the first comprehensive survey of joint perception and prediction for autonomous driving. We propose a taxonomy that categorizes approaches based on input representation, scene context modeling, and output representation, highlighting their contributions and limitations. Additionally, we present a qualitative analysis and quantitative comparison of existing methods. Finally, we discuss future research directions based on identified gaps in the state-of-the-art.

Joint Perception and Prediction for Autonomous Driving: A Survey

TL;DR

Abstract

Joint Perception and Prediction for Autonomous Driving: A Survey

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)