Unlock the Power of Unlabeled Data in Language Driving Model

Chaoqun Wang; Jie Yang; Xiaobin Hong; Ruimao Zhang

Unlock the Power of Unlabeled Data in Language Driving Model

Chaoqun Wang, Jie Yang, Xiaobin Hong, Ruimao Zhang

TL;DR

This work tackles the data annotation bottleneck in Vision-Language Driving models by introducing an iterative semi-supervised framework that leverages abundant unlabeled driving scenes. It combines template-based prompts to generate diverse VQA pairs, scene-graph construction to capture object relations, and a Self-Consistency Refinement module to assign reliability scores to pseudo-labels, enabling progressive training of a Language Driving Model (LDM) such as InternVL2-based architectures. Quantitatively, the approach achieves $44.85\%$ final score with only $5\%$ labeled data and improves to $54.27\%$ when unlabeled data is used, approaching the full-data score of $60.68\%$ on DriveLM; validation shows consistent +$9.26\%$ gains. This demonstrates that unlabeled driving data can substantially scale VisionLLMs for driving scene question-answering, reducing annotation costs while maintaining strong performance gains for downstream autonomous driving tasks.

Abstract

Recent Vision-based Large Language Models~(VisionLLMs) for autonomous driving have seen rapid advancements. However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive. To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner. Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data. Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. By utilizing a pre-trained VisionLLM (e.g., InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-the-art methods. Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets. In particular, our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27% when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark.

Unlock the Power of Unlabeled Data in Language Driving Model

TL;DR

Abstract

Unlock the Power of Unlabeled Data in Language Driving Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)