Table of Contents
Fetching ...

DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models

Xirui Zhou, Lianlei Shan, Xiaolin Gui

TL;DR

The paper tackles the loss of fine-grained visual detail caused by downsampling in vision-language models for autonomous driving perception. It introduces DynRsl-VLM, which uses dynamic-resolution image inputs generated from ROIs and merged region representations, coupled with a dedicated DynRsl image-text alignment module that replaces Q-Former. The method employs multi-view ViT features and a suite of pretraining objectives, including symmetric InfoNCE, ITG, and ITM with hard negatives, to align multi-resolution visual features with text. Experiments on NuInstruct show consistent gains in perception, prediction, risk assessment, and planning tasks, demonstrating improved environmental understanding under realistic computational constraints. The work provides a practical pathway to robust multimodal perception in autonomous driving.

Abstract

Visual Question Answering (VQA) models, which fall under the category of vision-language models, conventionally execute multiple downsampling processes on image inputs to strike a balance between computational efficiency and model performance. Although this approach aids in concentrating on salient features and diminishing computational burden, it incurs the loss of vital detailed information, a drawback that is particularly damaging in end-to-end autonomous driving scenarios. Downsampling can lead to an inadequate capture of distant or small objects such as pedestrians, road signs, or obstacles, all of which are crucial for safe navigation. This loss of features negatively impacts an autonomous driving system's capacity to accurately perceive the environment, potentially escalating the risk of accidents. To tackle this problem, we put forward the Dynamic Resolution Vision Language Model (DynRsl-VLM). DynRsl-VLM incorporates a dynamic resolution image input processing approach that captures all entity feature information within an image while ensuring that the image input remains computationally tractable for the Vision Transformer (ViT). Moreover, we devise a novel image-text alignment module to replace the Q-Former, enabling simple and efficient alignment with text when dealing with dynamic resolution image inputs. Our method enhances the environmental perception capabilities of autonomous driving systems without overstepping computational constraints.

DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models

TL;DR

The paper tackles the loss of fine-grained visual detail caused by downsampling in vision-language models for autonomous driving perception. It introduces DynRsl-VLM, which uses dynamic-resolution image inputs generated from ROIs and merged region representations, coupled with a dedicated DynRsl image-text alignment module that replaces Q-Former. The method employs multi-view ViT features and a suite of pretraining objectives, including symmetric InfoNCE, ITG, and ITM with hard negatives, to align multi-resolution visual features with text. Experiments on NuInstruct show consistent gains in perception, prediction, risk assessment, and planning tasks, demonstrating improved environmental understanding under realistic computational constraints. The work provides a practical pathway to robust multimodal perception in autonomous driving.

Abstract

Visual Question Answering (VQA) models, which fall under the category of vision-language models, conventionally execute multiple downsampling processes on image inputs to strike a balance between computational efficiency and model performance. Although this approach aids in concentrating on salient features and diminishing computational burden, it incurs the loss of vital detailed information, a drawback that is particularly damaging in end-to-end autonomous driving scenarios. Downsampling can lead to an inadequate capture of distant or small objects such as pedestrians, road signs, or obstacles, all of which are crucial for safe navigation. This loss of features negatively impacts an autonomous driving system's capacity to accurately perceive the environment, potentially escalating the risk of accidents. To tackle this problem, we put forward the Dynamic Resolution Vision Language Model (DynRsl-VLM). DynRsl-VLM incorporates a dynamic resolution image input processing approach that captures all entity feature information within an image while ensuring that the image input remains computationally tractable for the Vision Transformer (ViT). Moreover, we devise a novel image-text alignment module to replace the Q-Former, enabling simple and efficient alignment with text when dealing with dynamic resolution image inputs. Our method enhances the environmental perception capabilities of autonomous driving systems without overstepping computational constraints.

Paper Structure

This paper contains 17 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The architecture of our model that acquires multi-resolution images, performs visual-text alignment, and conducts efficient computations.
  • Figure 2: Method for obtaining Region Images. This diagram illustrates the approach for acquiring Region Images, which include both individual entity regions and combined regions. The blue solid-line boxes represent the ROIs, while the yellow dashed-line boxes denote the combined region.
  • Figure 3: Method for obtaining DynRsl image inputs. This diagram shows how DynRsl image inputs are derived from the original high-resolution image, where the overall image resolution is reduced while maintaining the clarity of Region Images.
  • Figure 4: Architecture of the alignment module and the losses employed during model training. First, we extract features from the image input using the frozen ViT, and then apply attention between the image features at different resolutions. The resulting features are considered our dynrsl image features, which are aligned with the text features. This alignment involves matching multiple image features to a single text feature. This alignment method offers finer granularity and is achieved through the design of three distinct loss functions.