Table of Contents
Fetching ...

All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Mahlagha Fazeli, Abolfazl Razi

TL;DR

This survey addresses the core challenge of object detection in autonomous driving by surveying sensor technologies, data infrastructure, and detection methodologies in a multimodal context. It foregrounds transformer-based and foundation-model approaches (LLMs and VLMs) and articulates a taxonomy that organizes AV datasets into ego-vehicle, roadside, and cooperative categories, including V2X platforms. The paper contributions include a structured synthesis of 2D/3D detection, sensor fusion strategies, and emerging LLM/VLM methods, along with a forward-looking roadmap that highlights context-aware fusion, cooperative perception, and simulation-to-reality adaptation. The work aims to guide researchers and practitioners toward robust, scalable, and interpretable AV perception systems capable of operating safely in diverse real-world environments.

Abstract

Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.

All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

TL;DR

This survey addresses the core challenge of object detection in autonomous driving by surveying sensor technologies, data infrastructure, and detection methodologies in a multimodal context. It foregrounds transformer-based and foundation-model approaches (LLMs and VLMs) and articulates a taxonomy that organizes AV datasets into ego-vehicle, roadside, and cooperative categories, including V2X platforms. The paper contributions include a structured synthesis of 2D/3D detection, sensor fusion strategies, and emerging LLM/VLM methods, along with a forward-looking roadmap that highlights context-aware fusion, cooperative perception, and simulation-to-reality adaptation. The work aims to guide researchers and practitioners toward robust, scalable, and interpretable AV perception systems capable of operating safely in diverse real-world environments.

Abstract

Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.

Paper Structure

This paper contains 42 sections, 17 figures, 23 tables.

Figures (17)

  • Figure 1: Visualization of object detection across multiple sensor modalities in autonomous vehicles. The RGB image demonstrates 2D detection with colored bounding boxes for cars, pedestrians, and cyclists. LiDAR and Radar point cloud showcases 3D detection through spatially aligned boxes. The integration of multimodal AI, including LLMs and VLMs, supports contextual understanding and enhances detection accuracy by fusing visual and spatial cues from diverse sensor inputs.
  • Figure 2: The organization of this survey paper.
  • Figure 3: Overview of major sensors used in AVs based on their types and perception performance. Sensor performance is evaluated on a scale from 1 to 5, where 1=Very Low, 2=Low, 3=Medium, 4=High, and 5=Very High
  • Figure 4: Overview of major AV datasets based on their specifications and applications.
  • Figure 5: A comprehensive taxonomy of object detection methods in AVs, categorized into four primary types. Each category includes representative subtypes based on sensor configuration, data representation, fusion strategy, and model architecture.
  • ...and 12 more figures