Table of Contents
Fetching ...

Can We Talk Models Into Seeing the World Differently?

Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, M. Jehanzeb Mirza, Margret Keuper, Janis Keuper

TL;DR

The paper investigates texture vs shape bias in vision-language models and whether language prompts can steer visual cue usage. Using a large-scale evaluation across VLMs on cue-conflict datasets for VQA and image captioning, the authors quantify bias and compare with vision encoders and humans. They show VLMs inherit partial biases from vision encoders, but multi-modal fusion and LLM processing modulate cue reliance, enabling steerability via prompts with minimal accuracy loss. Steering extends to low/high-frequency cues through prompt design and automated search, highlighting implications for alignment and controllability in vision-language systems.

Abstract

Unlike traditional vision-only models, vision language models (VLMs) offer an intuitive way to access visual content through language prompting by combining a large language model (LLM) with a vision encoder. However, both the LLM and the vision encoder come with their own set of biases, cue preferences, and shortcuts, which have been rigorously studied in uni-modal models. A timely question is how such (potentially misaligned) biases and cue preferences behave under multi-modal fusion in VLMs. As a first step towards a better understanding, we investigate a particularly well-studied vision-only bias - the texture vs. shape bias and the dominance of local over global information. As expected, we find that VLMs inherit this bias to some extent from their vision encoders. Surprisingly, the multi-modality alone proves to have important effects on the model behavior, i.e., the joint training and the language querying change the way visual cues are processed. While this direct impact of language-informed training on a model's visual perception is intriguing, it raises further questions on our ability to actively steer a model's output so that its prediction is based on particular visual cues of the user's choice. Interestingly, VLMs have an inherent tendency to recognize objects based on shape information, which is different from what a plain vision encoder would do. Further active steering towards shape-based classifications through language prompts is however limited. In contrast, active VLM steering towards texture-based decisions through simple natural language prompts is often more successful. URL: https://github.com/paulgavrikov/vlm_shapebias

Can We Talk Models Into Seeing the World Differently?

TL;DR

The paper investigates texture vs shape bias in vision-language models and whether language prompts can steer visual cue usage. Using a large-scale evaluation across VLMs on cue-conflict datasets for VQA and image captioning, the authors quantify bias and compare with vision encoders and humans. They show VLMs inherit partial biases from vision encoders, but multi-modal fusion and LLM processing modulate cue reliance, enabling steerability via prompts with minimal accuracy loss. Steering extends to low/high-frequency cues through prompt design and automated search, highlighting implications for alignment and controllability in vision-language systems.

Abstract

Unlike traditional vision-only models, vision language models (VLMs) offer an intuitive way to access visual content through language prompting by combining a large language model (LLM) with a vision encoder. However, both the LLM and the vision encoder come with their own set of biases, cue preferences, and shortcuts, which have been rigorously studied in uni-modal models. A timely question is how such (potentially misaligned) biases and cue preferences behave under multi-modal fusion in VLMs. As a first step towards a better understanding, we investigate a particularly well-studied vision-only bias - the texture vs. shape bias and the dominance of local over global information. As expected, we find that VLMs inherit this bias to some extent from their vision encoders. Surprisingly, the multi-modality alone proves to have important effects on the model behavior, i.e., the joint training and the language querying change the way visual cues are processed. While this direct impact of language-informed training on a model's visual perception is intriguing, it raises further questions on our ability to actively steer a model's output so that its prediction is based on particular visual cues of the user's choice. Interestingly, VLMs have an inherent tendency to recognize objects based on shape information, which is different from what a plain vision encoder would do. Further active steering towards shape-based classifications through language prompts is however limited. In contrast, active VLM steering towards texture-based decisions through simple natural language prompts is often more successful. URL: https://github.com/paulgavrikov/vlm_shapebias
Paper Structure (45 sections, 2 equations, 13 figures, 10 tables)

This paper contains 45 sections, 2 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Language can be used to steer visual cue preferences (biases) in vision language models (VLMs). Here we illustrate the (visual) texture/shape bias geirhos2018imagenettrained of some exemplary VLMs, and highlight the steerability of InternVL-Chat 1.1chen2024internvl through the processing of vision and text inputs (prompts).
  • Figure 2: Most VLMs prioritize shapes over texture cues. We measure the shape bias on the cue-conflict dataset geirhos2018imagenettrained. For reference, we also provide measurements on an ImageNet-trained ResNet-50 resnet, zero-shot classification with CLIP ViT-L/14 radford21clip, and a human average (over 10 subjects geirhos2018imagenettrained). The results in table format are shown in \ref{['sup_sec:detailed_results']}.
  • Figure 3: Confidence distribution of shape and texture tokens for all samples. All models form highly biased decisions by completely ignoring one cue. Measured on LLaVA-NeXT 7B, InternVL-Chat 1.1, and MoE-LLaVA-Phi2 for the VQA task.
  • Figure 4: Language can steer the texture/shape bias to some extent. We test the same texture/shape-biased instructions on multiple models, showing that these can already shift some decisions (usually in favor of texture). The stated percentages refer to the achieved accuracy on cue-conflict. For InternVL 1.1 and LLaVA-NeXT 7B we additionally test the understanding of texture/shape by using synonyms. Furthermore, we use an LLM to automatically search for specific prompts to optimize in either direction.
  • Figure 5: Detailed shape bias measurements under synonyms for biased VQA prompts.
  • ...and 8 more figures