Table of Contents
Fetching ...

Unlock the Power of Unlabeled Data in Language Driving Model

Chaoqun Wang, Jie Yang, Xiaobin Hong, Ruimao Zhang

TL;DR

This work tackles the data annotation bottleneck in Vision-Language Driving models by introducing an iterative semi-supervised framework that leverages abundant unlabeled driving scenes. It combines template-based prompts to generate diverse VQA pairs, scene-graph construction to capture object relations, and a Self-Consistency Refinement module to assign reliability scores to pseudo-labels, enabling progressive training of a Language Driving Model (LDM) such as InternVL2-based architectures. Quantitatively, the approach achieves $44.85\%$ final score with only $5\%$ labeled data and improves to $54.27\%$ when unlabeled data is used, approaching the full-data score of $60.68\%$ on DriveLM; validation shows consistent +$9.26\%$ gains. This demonstrates that unlabeled driving data can substantially scale VisionLLMs for driving scene question-answering, reducing annotation costs while maintaining strong performance gains for downstream autonomous driving tasks.

Abstract

Recent Vision-based Large Language Models~(VisionLLMs) for autonomous driving have seen rapid advancements. However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive. To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner. Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data. Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. By utilizing a pre-trained VisionLLM (e.g., InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-the-art methods. Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets. In particular, our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27% when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark.

Unlock the Power of Unlabeled Data in Language Driving Model

TL;DR

This work tackles the data annotation bottleneck in Vision-Language Driving models by introducing an iterative semi-supervised framework that leverages abundant unlabeled driving scenes. It combines template-based prompts to generate diverse VQA pairs, scene-graph construction to capture object relations, and a Self-Consistency Refinement module to assign reliability scores to pseudo-labels, enabling progressive training of a Language Driving Model (LDM) such as InternVL2-based architectures. Quantitatively, the approach achieves final score with only labeled data and improves to when unlabeled data is used, approaching the full-data score of on DriveLM; validation shows consistent + gains. This demonstrates that unlabeled driving data can substantially scale VisionLLMs for driving scene question-answering, reducing annotation costs while maintaining strong performance gains for downstream autonomous driving tasks.

Abstract

Recent Vision-based Large Language Models~(VisionLLMs) for autonomous driving have seen rapid advancements. However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive. To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner. Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data. Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. By utilizing a pre-trained VisionLLM (e.g., InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-the-art methods. Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets. In particular, our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27% when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark.

Paper Structure

This paper contains 16 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: (a): The proposed semi-supervised Language Driving Mode (LDM) traning pipeline. We first fine-tune an LDM in a supervised manner with a few labeled data. Then, given a fine-tuned LDM and unlabeled data, we generate VQAs via templated-based prompts and refine them by the proposed Self-Consistency Refinement (SCR). The refined VQAs are used for subsequent model training. (b): Preliminary experiment on DriveLM benchmark, inferring the fine-tuned LDM with hints could obtain better performance.
  • Figure 2: Overall framework of Language Driving Model. Given multi-view images, we feed them into a shared InternViT-6B. Then, the language tokens from the prompt and the projected vision tokens are fed into the InternLM2-20B model to generate the responses.
  • Figure 3: Overall iterative training pipeline. Given an unlabeled image, we utilize the fine-tuned LDM $\mathcal{M}_t$ to generate the multiple VQAs and build the scene graph from the predictions. Then, by extracting graph-based hints, the question could be re-asked. By calculating the distance between the two predictions, we could obtain the reliable score for each prediction which is utilized as the sample-wise balance weight to train the subsequent model $\mathcal{M}_{t+1}$ in the next iteration.
  • Figure 4: Visualization of the generated QA pairs on the unlabeled images, including perception/prediction questions to extract the node's attributions and planning questions for the edges. Each VQA pair contains a score obtained via proposed SCR.