Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Hengyi Wang; Haizhou Shi; Shiwei Tan; Weiyi Qin; Wenyuan Wang; Tunyu Zhang; Akshay Nambi; Tanuja Ganu; Hao Wang

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, Hao Wang

TL;DR

The paper addresses the gap in evaluating long-context understanding of multimodal LLMs. It introduces MMNeedle, a benchmark that uses stitched, multi-image haystacks and a formal protocol to retrieve a target sub-image (the needle) based on textual captions, thereby stress-testing long-context visual comprehension and information retrieval. It provides an automatic labeling scheme for sub-image retrieval and reports that GPT-4o dominates in long-context tasks while still exhibiting hallucinations on negative samples, highlighting a performance gap between API-based and open-source models. The work also offers reproducible resources, with code, data, and instructions available at the provided GitHub repository.

Abstract

Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. Our comprehensive long-context evaluation of MLLMs also sheds lights on the considerable performance gap between API-based and open-source models. All the code, data, and instructions required to reproduce the main results are available at https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack.

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (1 section, 1 figure)

This paper contains 1 section, 1 figure.

Introduction

Figures (1)

Figure 1: MMNeedle evaluation overview. Correct answers are marked with checkmark ($\checkmark$), while the incorrect answers are marked with cross ($\bm{\times}$). Our evaluation setup involves the following key components: (a) Needle Sub-Image: The needle sub-image to be retrieved based on the given caption. (b) Haystack Image Inputs: The long-context visual inputs consist of $M$ images, each stitched from $N\times N$ sub-images. (c) Text Inputs (Instructions and Caption): Detailed instructions to MLLMs, followed by a caption describing the needle, i.e., sub-image $20$. See Sec. \ref{['sec:instruction']} for MMNeedle's complete instructions. (d) LLM Outputs: The answers from different MLLMs, indicating their ability to accurately locate the needle in the haystack based on the given caption. The expected output is composed of the model's identification of the index, row, and column of the matching sub-image. The results showcase the comparative performance of various models: GPT-4o correctly predicts the exact location of the needle; Gemini Pro 1.5 only correctly predicts the image index of the needle; other API models predict incorrect locations; open-source models often output with wrong formats.

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

TL;DR

Abstract

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (1)