Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
Aditya Sharma, Michael Saxon, William Yang Wang
TL;DR
LoCoVQA introduces a dynamic benchmark for long-context visual reasoning, systematically testing extractive attention by surrounding a content image with distractor images in both composed and interleaved formats. Across OK-VQA, MMStar, and MNIST-derived tasks, a robust logarithmic decay in VLM performance emerges as visual context grows, revealing a fundamental difficulty in ignoring irrelevant images. The study uncovers positional biases in interleaved inputs and demonstrates that even top proprietary models struggle with basic extraction tasks in long contexts, highlighting gaps in training objectives that do not emphasize cross-image attention. By generalizing to any VQA dataset, LoCoVQA offers a practical, scalable path to diagnose and improve long-context multimodal reasoning in current and future vision-language models.
Abstract
We present LoCoVQA, a dynamic benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images. Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking logarithmic decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries -- a task that is quite easy for language models (LMs) in the text domain -- demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.
