Table of Contents
Fetching ...

Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Linfeng He, Yiming Sun, Sihao Wu, Jiaxu Liu, Xiaowei Huang

TL;DR

This paper extends the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network, addressing limitations in object detection and localisation and introduces camera ID-separators to improve multi-view processing.

Abstract

In this paper, we propose a novel framework for enhancing visual comprehension in autonomous driving systems by integrating visual language models (VLMs) with additional visual perception module specialised in object detection. We extend the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network, addressing limitations in object detection and localisation. Our approach introduces camera ID-separators to improve multi-view processing, crucial for comprehensive environmental awareness. Experiments on the DriveLM visual question answering challenge demonstrate significant improvements over baseline models, with enhanced performance in ChatGPT scores, BLEU scores, and CIDEr metrics, indicating closeness of model answer to ground truth. Our method represents a promising step towards more capable and interpretable autonomous driving systems. Possible safety enhancement enabled by detection modality is also discussed.

Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

TL;DR

This paper extends the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network, addressing limitations in object detection and localisation and introduces camera ID-separators to improve multi-view processing.

Abstract

In this paper, we propose a novel framework for enhancing visual comprehension in autonomous driving systems by integrating visual language models (VLMs) with additional visual perception module specialised in object detection. We extend the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network, addressing limitations in object detection and localisation. Our approach introduces camera ID-separators to improve multi-view processing, crucial for comprehensive environmental awareness. Experiments on the DriveLM visual question answering challenge demonstrate significant improvements over baseline models, with enhanced performance in ChatGPT scores, BLEU scores, and CIDEr metrics, indicating closeness of model answer to ground truth. Our method represents a promising step towards more capable and interpretable autonomous driving systems. Possible safety enhancement enabled by detection modality is also discussed.

Paper Structure

This paper contains 16 sections, 8 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Schematic of our VLM Autonomous Driving framework. Projection Network and Detection Network process BEV view camera images respectively for a QA pair. Their outputted token are then merged into hidden state of each layer in decoder layers in the language model to pass to next layer.
  • Figure 2: Detection network for detector query generation. Each image is processed into tokens separately then concatenated together with trainable ID-separator tokens. $f_{\mathrm{proj}}$ is implemented with the Detector Adaptation Transformer Encoder $\mathbb{R}^{M \times d_{\mathbf{yolos}}} \to \mathbb{R}^{M \times d_{\mathbf{yolos}}}$ and the Projection Network $\mathbb{R}^{M \times d_{\mathbf{yolos}}} \to \mathbb{R}^{M \times d_{\mathbf{emb}}}$