Table of Contents
Fetching ...

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E. Gonzalez, Trevor Darrell, David M. Chan

TL;DR

This paper introduces Visual Haystacks (VHs), a vision-centric long-context benchmark that challenges large multimodal models to retrieve and reason across thousands of images. It documents limitations of existing long-context LMMs in MIQA and proposes MIRAGE, an open-source visual-RAG framework that compresses image tokens, uses a query-aware retriever, and is trained on a large MIQA instruction dataset to scale to 10k images. MIRAGE achieves up to 13% improvement over open-source baselines on VHs and sets a new state-of-the-art on RetVQA while remaining competitive on single-image QA, highlighting both scalability and efficiency gains. The work provides the dataset, model, and code openly, offering a practical path toward real-world MIQA applications such as photo-collection search and geospatial imagery analysis.

Abstract

Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like photo album searches or satellite imagery analysis. In this work, we first assess the limitations of current benchmarks for long-context LMMs. We address these limitations by introducing a new vision-centric, long-context benchmark, "Visual Haystacks (VHs)". We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window. Towards a solution, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU -- far surpassing the 1k-image limit of contemporary models. MIRAGE demonstrates up to 13% performance improvement over existing open-source LMMs on VHs, sets a new state-of-the-art on the RetVQA multi-image QA benchmark, and achieves competitive performance on single-image QA with state-of-the-art LMMs. Our dataset, model, and code are available at: https://visual-haystacks.github.io.

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

TL;DR

This paper introduces Visual Haystacks (VHs), a vision-centric long-context benchmark that challenges large multimodal models to retrieve and reason across thousands of images. It documents limitations of existing long-context LMMs in MIQA and proposes MIRAGE, an open-source visual-RAG framework that compresses image tokens, uses a query-aware retriever, and is trained on a large MIQA instruction dataset to scale to 10k images. MIRAGE achieves up to 13% improvement over open-source baselines on VHs and sets a new state-of-the-art on RetVQA while remaining competitive on single-image QA, highlighting both scalability and efficiency gains. The work provides the dataset, model, and code openly, offering a practical path toward real-world MIQA applications such as photo-collection search and geospatial imagery analysis.

Abstract

Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like photo album searches or satellite imagery analysis. In this work, we first assess the limitations of current benchmarks for long-context LMMs. We address these limitations by introducing a new vision-centric, long-context benchmark, "Visual Haystacks (VHs)". We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window. Towards a solution, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU -- far surpassing the 1k-image limit of contemporary models. MIRAGE demonstrates up to 13% performance improvement over existing open-source LMMs on VHs, sets a new state-of-the-art on the RetVQA multi-image QA benchmark, and achieves competitive performance on single-image QA with state-of-the-art LMMs. Our dataset, model, and code are available at: https://visual-haystacks.github.io.
Paper Structure (21 sections, 1 equation, 15 figures, 3 tables)

This paper contains 21 sections, 1 equation, 15 figures, 3 tables.

Figures (15)

  • Figure 1: (A) Unlike existing visual Needle-In-A-Haystack (NIAH) challenges reid2024gemini that overlay needle information as text onto an image, our "Visual Haystacks" (VHs) benchmark is vision-centric, requiring the model to first retrieve the needle image(s) from the haystack and then reason about the image(s) to answer the question. (B) We benchmark existing LMMs under different NIAH settings where only one needle image is present among ten images. While traditional visual NIAH challenges overemphasize text retrieval, which can be easily hacked by state-of-the-art models with strong OCR capabilities, they are unable to solve the simple visual questions in VHs.
  • Figure 2: Experimental results on the VHs single-needle challenge. All LMMs experience significant falloff as the size of the haystack (N) increases, indicating that existing approaches are not robust to complex visual-linguistic processing over long visual contexts. Note the non-linear x-axis in this plot.
  • Figure 3: Experimental results on the VHs multi-needle challenge reveal insightful outcomes. (A) The oracle experiment, which uses only needle images as input, demonstrates significant performance degradation in both proprietary and open-source LMMs when required to integrate information across multiple images. (B) In the full multi-needle challenge that includes distractor images, we observed a performance decline of existing LMMs as the size of the haystack (N) increases. Given the same haystack size, the performance deteriorates considerably compared to the single-needle challenge across all models in most scenarios. These findings indicate that current methodologies struggle with real-world, large-scale multi-image QA tasks that demand both visual retrieval and reasoning across extensive visual contexts.
  • Figure 4: Plots showing needle position, vs. performance on the VHs benchmark for several image settings. For existing LMMs, the needle position is extremely important, with performance degradation of up to 25% when the needle is not placed in the optimal location in the input context. The gray boxes indicate that these experiments exceed the available context length or the model is unable to execute on 4 A100 GPUs.
  • Figure 5: MIRAGE enables large-scale long-context visual retrieval and reasoning through a combination of key components and a large-scale training set. (A) MIRAGE processes questions and images in several stages: first, image features are encoded using CLIP, followed by compression through our Q-Former. A retriever module then calculates relevance scores, ensuring that only the most relevant images are passed to the LLM for final reasoning. The red dashed line illustrates the key difference between conventional LMMs, like LLaVA, and our visual-RAG approach. (B) To further improve MIRAGE's performance in long-context retrieval and reasoning, we introduce a large-scale MIQA instruction-tuning dataset, blending both synthetic and real-world data.
  • ...and 10 more figures