Table of Contents
Fetching ...

Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts

Aditya Sharma, Michael Saxon, William Yang Wang

TL;DR

LoCoVQA introduces a dynamic benchmark for long-context visual reasoning, systematically testing extractive attention by surrounding a content image with distractor images in both composed and interleaved formats. Across OK-VQA, MMStar, and MNIST-derived tasks, a robust logarithmic decay in VLM performance emerges as visual context grows, revealing a fundamental difficulty in ignoring irrelevant images. The study uncovers positional biases in interleaved inputs and demonstrates that even top proprietary models struggle with basic extraction tasks in long contexts, highlighting gaps in training objectives that do not emphasize cross-image attention. By generalizing to any VQA dataset, LoCoVQA offers a practical, scalable path to diagnose and improve long-context multimodal reasoning in current and future vision-language models.

Abstract

We present LoCoVQA, a dynamic benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images. Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking logarithmic decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries -- a task that is quite easy for language models (LMs) in the text domain -- demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.

Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts

TL;DR

LoCoVQA introduces a dynamic benchmark for long-context visual reasoning, systematically testing extractive attention by surrounding a content image with distractor images in both composed and interleaved formats. Across OK-VQA, MMStar, and MNIST-derived tasks, a robust logarithmic decay in VLM performance emerges as visual context grows, revealing a fundamental difficulty in ignoring irrelevant images. The study uncovers positional biases in interleaved inputs and demonstrates that even top proprietary models struggle with basic extraction tasks in long contexts, highlighting gaps in training objectives that do not emphasize cross-image attention. By generalizing to any VQA dataset, LoCoVQA offers a practical, scalable path to diagnose and improve long-context multimodal reasoning in current and future vision-language models.

Abstract

We present LoCoVQA, a dynamic benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images. Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking logarithmic decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries -- a task that is quite easy for language models (LMs) in the text domain -- demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.
Paper Structure (28 sections, 9 figures, 4 tables)

This paper contains 28 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 2: Example of Image ($X$) corresponding to question-answer pair ($Q$, $A$) under increasing visual context lengths in the composed and interleaved settings. The green box for illustration only, not included in model inputs.
  • Figure 3: Each subfigure represents a variable number of hidden MNIST digits in a 9 image composed context.
  • Figure 4: VLM Performance on MMStar and OK-VQA. Note the clearly declining logarithmic fit trends for many of the models. The (model, task) pairs for which these trends do not hold by and large are below the random baseline.
  • Figure 5: Radar plots of VLM performance across 8 multimodal benchmarks with varied visual context lengths.
  • Figure 6: VLM Performance on the MNIST-Digits transcription task as a function of # of digits to transcribe. These plots have a different x-axis than the plots in Figure 1 and \ref{['fig:mmstar-ok-vqa-plot']}: rather than the relationship between context size and performance, we are assessing the relationship between "task difficulty" and performance, at four context sizes.
  • ...and 4 more figures