Table of Contents
Fetching ...

Tinted Frames: Question Framing Blinds Vision-Language Models

Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta

Abstract

Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

Tinted Frames: Question Framing Blinds Vision-Language Models

Abstract

Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.
Paper Structure (56 sections, 1 equation, 14 figures, 6 tables)

This paper contains 56 sections, 1 equation, 14 figures, 6 tables.

Figures (14)

  • Figure 1: VLM grounding changes as a function of question framing. Attention maps reveal that while the model actively attends to the target object (the chair) during open-ended generation, it exhibits disengagement and misallocation when the same question is posed as a Yes/No or MCQ task. Note that we employ attention rollout rather than averaging attention weights across layers and token. By recursively rolling out attention matrices, we trace information propagation from inputs of early layers to output embeddings. Qwen2.5-VL-7B is used for the visualization. The top 3 tokens with the highest attention are highlighted in red boxes, and the minimum and maximum values of the linear colormap are set to the same value for all images.
  • Figure 2: Illustration of impact of question framing. We hypothesize that question framing influences model predictions through visual attention. Framing alters attention allocation (F$\rightarrow$A), which in turn degrades prediction quality (A$\rightarrow$Y).
  • Figure 3: Tested open-source VLMs have a significant inconsistency rates across framings and task types. (Left) Cross-framing inconsistency is evaluated by reframing questions with an LLM. (Right) With Qwen2.5-VL-7B, inconsistency is up to 26% on GQA and 38% on SeedBench.
  • Figure 4: Visual energy drops significantly on non-open-ended framings. (Top) There is a significant drop in attention applied to the portion of the image containing the object of interest, and a corresponding increase in attention to sink tokens, for yes/no and MCQ framing. (Bottom) Per-layer values of visual energy and box attention of Qwen2.5-VL-7B on GQA$^\text{F}$.
  • Figure 5: (Top) Illustration of question/instruction variation. (Bottom) Coefficient of quartile variation on VE and Box for varying framing and instruction.
  • ...and 9 more figures