TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving
Hossein Hassani, Soodeh Nikan, Abdallah Shami
TL;DR
This work targets the high computational cost of Vision-Language Models for visual question answering in autonomous driving. It introduces TinyDrive, a compact VLM that combines a multiscale CNN vision encoder with scale injection and cross-scale gating, a token routing mechanism to prune text tokens, and a sequence priority buffer to bias training toward informative samples. Empirical results on the Rosmaster dataset and the DriveLM-nuScenes benchmark show TinyDrive achieves state-of-the-art language understanding with substantially fewer parameters and FLOPs, including notable gains in BLEU-4 and METEOR scores. The approach enables practical, efficient VQA in resource-constrained autonomous vehicles, advancing real-time perception-language reasoning capabilities.
Abstract
Vision Language Models (VLMs) employed for visual question-answering (VQA) in autonomous driving often require substantial computational resources that pose a challenge for their deployment in resource-constrained vehicles. To address this challenge, we introduce TinyDrive, a lightweight yet effective VLM for multi-view VQA in driving scenarios. Our model comprises two key components including a multiscale vision encoder and a dual-level prioritization mechanism for tokens and sequences. The multiscale encoder facilitates the processing of multi-view images at diverse resolutions through scale injection and cross-scale gating to generate enhanced visual representations. At the token level, we design a token routing mechanism that dynamically selects and process the most informative tokens based on learned importance scores. At the sequence level, we propose integrating normalized loss, uncertainty estimates, and a diversity metric to formulate sequence scores that rank and preserve samples within a sequence priority buffer. Samples with higher scores are more frequently selected for training. TinyDrive is first evaluated on our custom-curated VQA dataset, and it is subsequently tested on the public DriveLM benchmark, where it achieves state-of-the-art language understanding performance. Notably, it achieves relative improvements of 11.1% and 35.4% in BLEU-4 and METEOR scores, respectively, despite having a significantly smaller parameter count.
