Table of Contents
Fetching ...

Reducing Text Bias in Synthetically Generated MCQAs for VLMs in Autonomous Driving

Sutej Kulgod, Sean Ye, Sanchit Tanwar, Christoffer Heckman

TL;DR

Synthetically generated MCQAs for Vision-Language Models in autonomous driving can embed hidden textual cues that let models bypass visual grounding. The authors propose a debiasing pipeline: a two-stage MCQA generation that decouples distractor sampling from the ground-truth answer by sampling distractors in maneuver-label space, and a curriculum-based option dropping strategy to force visual grounding; they validate via video-disabled (zero-shot) evaluations to detect textual bias. Results show that using a debiased dataset $D_{new}$ reduces textual shortcuts, with zero-shot performance approaching random and curriculum-based training improving robustness, especially when the vision encoder is fully fine-tuned. This work improves benchmark validity for safety-critical VLMs by ensuring model performance reflects perceptual understanding rather than language priors, guiding future data-generation and evaluation practices in autonomous driving contexts.

Abstract

Multiple Choice Question Answering (MCQA) benchmarks are an established standard for measuring Vision Language Model (VLM) performance in driving tasks. However, we observe the known phenomenon that synthetically generated MCQAs are highly susceptible to hidden textual cues that allow models to exploit linguistic patterns rather than visual context. Our results show that a VLM fine-tuned on such data can achieve accuracy comparable to human-validated benchmarks even without visual input. Our proposed method reduces blind accuracy from +66.9% above random to +2.9%, eliminating the vast majority of exploitable textual shortcuts. By decoupling the correct answer from linguistic artifacts and employing a curriculum learning strategy, we force the model to rely on visual grounding, ensuring that performance accurately reflects perceptual understanding.

Reducing Text Bias in Synthetically Generated MCQAs for VLMs in Autonomous Driving

TL;DR

Synthetically generated MCQAs for Vision-Language Models in autonomous driving can embed hidden textual cues that let models bypass visual grounding. The authors propose a debiasing pipeline: a two-stage MCQA generation that decouples distractor sampling from the ground-truth answer by sampling distractors in maneuver-label space, and a curriculum-based option dropping strategy to force visual grounding; they validate via video-disabled (zero-shot) evaluations to detect textual bias. Results show that using a debiased dataset reduces textual shortcuts, with zero-shot performance approaching random and curriculum-based training improving robustness, especially when the vision encoder is fully fine-tuned. This work improves benchmark validity for safety-critical VLMs by ensuring model performance reflects perceptual understanding rather than language priors, guiding future data-generation and evaluation practices in autonomous driving contexts.

Abstract

Multiple Choice Question Answering (MCQA) benchmarks are an established standard for measuring Vision Language Model (VLM) performance in driving tasks. However, we observe the known phenomenon that synthetically generated MCQAs are highly susceptible to hidden textual cues that allow models to exploit linguistic patterns rather than visual context. Our results show that a VLM fine-tuned on such data can achieve accuracy comparable to human-validated benchmarks even without visual input. Our proposed method reduces blind accuracy from +66.9% above random to +2.9%, eliminating the vast majority of exploitable textual shortcuts. By decoupling the correct answer from linguistic artifacts and employing a curriculum learning strategy, we force the model to rely on visual grounding, ensuring that performance accurately reflects perceptual understanding.
Paper Structure (22 sections, 1 equation, 2 figures, 5 tables)

This paper contains 22 sections, 1 equation, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Top: We employ two video-disabled evaluations to detect textual cues. The Zero-Shot Test reveals inherent linguistic patterns in off-the-shelf models, while the SFT Test exposes the full magnitude of shortcut learning when a model is trained on biased data. Bottom: Our method replaces LLM-generated distractors with real descriptions sampled from elsewhere in the dataset. This reduces bias exploitation in both Zero-Shot and SFT.
  • Figure 2: Distribution of the correct option in the train and test subsets of $D_{llm}$ and $D_{new}$. We see even distribution of correct answers for all datasets.