Table of Contents
Fetching ...

Tracking Meets Large Multimodal Models for Driving Scenario Understanding

Ayesha Ishaq, Jean Lahoud, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer

TL;DR

The paper tackles the limitation of Large Multimodal Models (LMMs) in autonomous driving, where 3D spatial and temporal cues are underutilized. It introduces a tracking encoder that ingests 3D object tracks and ego motion, fusing this information with visual features through a trajectory encoder and a query-former, enabling richer spatiotemporal reasoning. A self-supervised pretraining regime and an automated annotation pipeline for trajectory data further enhance the model's understanding of dynamic driving scenarios. Empirical results on DriveLM-nuScenes and DriveLM-CARLA show substantial gains in accuracy and language generation metrics while maintaining competitive runtime, underscoring the approach's potential to improve perception, planning, and prediction in autonomous driving.

Abstract

Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result, their effectiveness in dynamic driving environments is limited. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details that are not effectively captured in the images. We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios. By incorporating 3D tracking data through a track encoder, we enrich visual queries with crucial spatial and temporal cues while avoiding the computational overhead associated with processing lengthy video sequences or extensive 3D inputs. Moreover, we employ a self-supervised approach to pretrain the tracking encoder to provide LMMs with additional contextual information, significantly improving their performance in perception, planning, and prediction tasks for autonomous driving. Experimental results demonstrate the effectiveness of our approach, with a gain of 9.5% in accuracy, an increase of 7.04 points in the ChatGPT score, and 9.4% increase in the overall score over baseline models on DriveLM-nuScenes benchmark, along with a 3.7% final score improvement on DriveLM-CARLA. Our code is available at https://github.com/mbzuai-oryx/TrackingMeetsLMM

Tracking Meets Large Multimodal Models for Driving Scenario Understanding

TL;DR

The paper tackles the limitation of Large Multimodal Models (LMMs) in autonomous driving, where 3D spatial and temporal cues are underutilized. It introduces a tracking encoder that ingests 3D object tracks and ego motion, fusing this information with visual features through a trajectory encoder and a query-former, enabling richer spatiotemporal reasoning. A self-supervised pretraining regime and an automated annotation pipeline for trajectory data further enhance the model's understanding of dynamic driving scenarios. Empirical results on DriveLM-nuScenes and DriveLM-CARLA show substantial gains in accuracy and language generation metrics while maintaining competitive runtime, underscoring the approach's potential to improve perception, planning, and prediction in autonomous driving.

Abstract

Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result, their effectiveness in dynamic driving environments is limited. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details that are not effectively captured in the images. We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios. By incorporating 3D tracking data through a track encoder, we enrich visual queries with crucial spatial and temporal cues while avoiding the computational overhead associated with processing lengthy video sequences or extensive 3D inputs. Moreover, we employ a self-supervised approach to pretrain the tracking encoder to provide LMMs with additional contextual information, significantly improving their performance in perception, planning, and prediction tasks for autonomous driving. Experimental results demonstrate the effectiveness of our approach, with a gain of 9.5% in accuracy, an increase of 7.04 points in the ChatGPT score, and 9.4% increase in the overall score over baseline models on DriveLM-nuScenes benchmark, along with a 3.7% final score improvement on DriveLM-CARLA. Our code is available at https://github.com/mbzuai-oryx/TrackingMeetsLMM

Paper Structure

This paper contains 24 sections, 9 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Our proposed approach. We integrate image and tracking information to enhance question-answering in the autonomous driving domain. In this example, the visual information is enriched with tracking data to provide crucial context about object movements and interactions over time. The additional tracking information allows the model to better interpret the driving scenario.
  • Figure 2: Addressing the limitations of LMM-based methods for driving scenario understanding. The baseline approach falls short in capturing object movements and interactions. This limitation arises from its reliance solely on image data, which lacks the necessary temporal dimension to understand dynamic environments. In contrast, our proposed method addresses these shortcomings by incorporating tracking information, including object locations and velocities, enabling the model to leverage enhanced spatiotemporal context.
  • Figure 3: Overview of our proposed method. Our system integrates visual, trajectory, and ego-motion information through visual and trajectory encoders. The input consists of multi-view images, which are processed to obtain visual embeddings, while a 3D tracker generates key object tracks. These tracks are then used to extract spatiotemporal embeddings, which are fused with the visual embeddings and transformed through a query former module. These multimodal embeddings are then passed to the large language model enhanced with adapters to align visual, tracking and textual modalities, enabling contextual reasoning and task-specific answers.
  • Figure 4: Architecture of our proposed trajectory encoder. The input tracks are tokenized via linear projection with positional embeddings. A transformer encoder refines these embeddings, capturing spatiotemporal relationships, which are output through an embedding head, preparing trajectory data for multimodal fusion.
  • Figure 5: Overview of the automated annotation pipeline. We make use of nuScenes nuscenes2019 ground truth and multi-view frames to generate tracks and question-answer pairs related to track attributes for pretraining our proposed trajectory encoders.
  • ...and 3 more figures