Table of Contents
Fetching ...

The in-context inductive biases of vision-language models differ across modalities

Kelsey Allen, Ishita Dasgupta, Eliza Kosoy, Andrew K. Lampinen

TL;DR

The paper investigates how vision-language foundation models generalize in-context and how the modality of input (vision vs text) influences inductive biases like shape vs color. It adapts three cognitive-category learning paradigms to three VLMs and measures bias shifts across image and text representations, including adjective-order effects in text. Key findings show a general shape bias, stronger with visual input, and a robust text-order effect whereby the sequence of adjectives biases generalization toward the first-mentioned feature, with results varying by model and task. These results illuminate modality-specific representations in vision-language systems and have practical implications for prompting and data presentation to steer in-context learning.

Abstract

Inductive biases are what allow learners to make guesses in the absence of conclusive evidence. These biases have often been studied in cognitive science using concepts or categories -- e.g. by testing how humans generalize a new category from a few examples that leave the category boundary ambiguous. We use these approaches to study generalization in foundation models during in-context learning. Modern foundation models can condition on both vision and text, and differences in how they interpret and learn from these different modalities is an emerging area of study. Here, we study how their generalizations vary by the modality in which stimuli are presented, and the way the stimuli are described in text. We study these biases with three different experimental paradigms, across three different vision-language models. We find that the models generally show some bias towards generalizing according to shape over color. This shape bias tends to be amplified when the examples are presented visually. By contrast, when examples are presented in text, the ordering of adjectives affects generalization. However, the extent of these effects vary across models and paradigms. These results help to reveal how vision-language models represent different types of inputs in context, and may have practical implications for the use of vision-language models.

The in-context inductive biases of vision-language models differ across modalities

TL;DR

The paper investigates how vision-language foundation models generalize in-context and how the modality of input (vision vs text) influences inductive biases like shape vs color. It adapts three cognitive-category learning paradigms to three VLMs and measures bias shifts across image and text representations, including adjective-order effects in text. Key findings show a general shape bias, stronger with visual input, and a robust text-order effect whereby the sequence of adjectives biases generalization toward the first-mentioned feature, with results varying by model and task. These results illuminate modality-specific representations in vision-language systems and have practical implications for prompting and data presentation to steer in-context learning.

Abstract

Inductive biases are what allow learners to make guesses in the absence of conclusive evidence. These biases have often been studied in cognitive science using concepts or categories -- e.g. by testing how humans generalize a new category from a few examples that leave the category boundary ambiguous. We use these approaches to study generalization in foundation models during in-context learning. Modern foundation models can condition on both vision and text, and differences in how they interpret and learn from these different modalities is an emerging area of study. Here, we study how their generalizations vary by the modality in which stimuli are presented, and the way the stimuli are described in text. We study these biases with three different experimental paradigms, across three different vision-language models. We find that the models generally show some bias towards generalizing according to shape over color. This shape bias tends to be amplified when the examples are presented visually. By contrast, when examples are presented in text, the ordering of adjectives affects generalization. However, the extent of these effects vary across models and paradigms. These results help to reveal how vision-language models represent different types of inputs in context, and may have practical implications for the use of vision-language models.

Paper Structure

This paper contains 12 sections, 6 figures.

Figures (6)

  • Figure 1: Conceptual overview of our experimental paradigms.
  • Figure 2: Across three category-generalization paradigms, VLMs generalize more by shapes than colors overall (bars above the midline). This bias tends to be amplified when the categories are presented as images (blue), compared to when they are presented in text (orange). However, this pattern is flipped for the odd-one-out task for some models. (Errorbars are bootstrapped 95%-CIs.)
  • Figure 3: The order in which the features are presented in text shifts the VLMs generalization biases; models show some generalization preference toward the feature that is mentioned first (green bars are higher than yellow bars). For one-category tasks, Claude & GPT refuse all generalizations; hence they show no text-order differences. (Errorbars are boostrapped 95%-CIs.)
  • Figure 4: Example stimuli used in our experiment, showing some of the range of colors, shapes, and viewpoints presented in the dataset. The variation in viewpoints is intended to ensure that the model is truly recognizing 3D shape rather than relying on canonical orientations.
  • Figure 5: Patterns of generalization of a single category presented with varying stimulus sets. There are noticeable changes across stimulus sets, and across modalities, but the patterns are not particularly consistent.
  • ...and 1 more figures