Table of Contents
Fetching ...

TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving

Hossein Hassani, Soodeh Nikan, Abdallah Shami

TL;DR

This work targets the high computational cost of Vision-Language Models for visual question answering in autonomous driving. It introduces TinyDrive, a compact VLM that combines a multiscale CNN vision encoder with scale injection and cross-scale gating, a token routing mechanism to prune text tokens, and a sequence priority buffer to bias training toward informative samples. Empirical results on the Rosmaster dataset and the DriveLM-nuScenes benchmark show TinyDrive achieves state-of-the-art language understanding with substantially fewer parameters and FLOPs, including notable gains in BLEU-4 and METEOR scores. The approach enables practical, efficient VQA in resource-constrained autonomous vehicles, advancing real-time perception-language reasoning capabilities.

Abstract

Vision Language Models (VLMs) employed for visual question-answering (VQA) in autonomous driving often require substantial computational resources that pose a challenge for their deployment in resource-constrained vehicles. To address this challenge, we introduce TinyDrive, a lightweight yet effective VLM for multi-view VQA in driving scenarios. Our model comprises two key components including a multiscale vision encoder and a dual-level prioritization mechanism for tokens and sequences. The multiscale encoder facilitates the processing of multi-view images at diverse resolutions through scale injection and cross-scale gating to generate enhanced visual representations. At the token level, we design a token routing mechanism that dynamically selects and process the most informative tokens based on learned importance scores. At the sequence level, we propose integrating normalized loss, uncertainty estimates, and a diversity metric to formulate sequence scores that rank and preserve samples within a sequence priority buffer. Samples with higher scores are more frequently selected for training. TinyDrive is first evaluated on our custom-curated VQA dataset, and it is subsequently tested on the public DriveLM benchmark, where it achieves state-of-the-art language understanding performance. Notably, it achieves relative improvements of 11.1% and 35.4% in BLEU-4 and METEOR scores, respectively, despite having a significantly smaller parameter count.

TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving

TL;DR

This work targets the high computational cost of Vision-Language Models for visual question answering in autonomous driving. It introduces TinyDrive, a compact VLM that combines a multiscale CNN vision encoder with scale injection and cross-scale gating, a token routing mechanism to prune text tokens, and a sequence priority buffer to bias training toward informative samples. Empirical results on the Rosmaster dataset and the DriveLM-nuScenes benchmark show TinyDrive achieves state-of-the-art language understanding with substantially fewer parameters and FLOPs, including notable gains in BLEU-4 and METEOR scores. The approach enables practical, efficient VQA in resource-constrained autonomous vehicles, advancing real-time perception-language reasoning capabilities.

Abstract

Vision Language Models (VLMs) employed for visual question-answering (VQA) in autonomous driving often require substantial computational resources that pose a challenge for their deployment in resource-constrained vehicles. To address this challenge, we introduce TinyDrive, a lightweight yet effective VLM for multi-view VQA in driving scenarios. Our model comprises two key components including a multiscale vision encoder and a dual-level prioritization mechanism for tokens and sequences. The multiscale encoder facilitates the processing of multi-view images at diverse resolutions through scale injection and cross-scale gating to generate enhanced visual representations. At the token level, we design a token routing mechanism that dynamically selects and process the most informative tokens based on learned importance scores. At the sequence level, we propose integrating normalized loss, uncertainty estimates, and a diversity metric to formulate sequence scores that rank and preserve samples within a sequence priority buffer. Samples with higher scores are more frequently selected for training. TinyDrive is first evaluated on our custom-curated VQA dataset, and it is subsequently tested on the public DriveLM benchmark, where it achieves state-of-the-art language understanding performance. Notably, it achieves relative improvements of 11.1% and 35.4% in BLEU-4 and METEOR scores, respectively, despite having a significantly smaller parameter count.

Paper Structure

This paper contains 29 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of TinyDrive against SOTA models on the DriveLM benchmark, evaluating average language understanding performance against (a) total trainable parameters and (b) total FLOPs. TinyDrive outperforms SOTA models while requiring substantially fewer computational resources.
  • Figure 2: Overall architecture of TinyDrive.
  • Figure 3: Sample generated answers by TinyDrive$_{v_{12}}$ for VQA on the Rosmaster dataset.
  • Figure 4: The generated attention maps through the high, mid, and low resolution branches for two classes including (a) turn right and (b) yellow light. For each sub-figure, the top figure shows the maps before fine-turning the classification head and the bottom illustrates the same but for after fine-tuning.
  • Figure 5: Sample images from the Rosmaster dataset. Images are captured using an RGB camera mounted on the self-driving car navigating through a driving map within a laboratory.