Table of Contents
Fetching ...

Exploring Perceptual Limitation of Multimodal Large Language Models

Jiarui Zhang, Jinyi Hu, Mahyar Khayatkhoei, Filip Ilievski, Maosong Sun

TL;DR

This work investigates why state-of-the-art multimodal large language models struggle to perceive small objects in visual inputs. It conducts a large-scale, controlled study across seven MLLMs on GQA and TextVQA, isolating four factors—object quality, size, distractors, and position—that affect perception of small objects. The results reveal universal size-related degradation and model-specific sensitivities to quality, distractors, and location, along with notable positional biases due to training data and patch-based processing. The authors propose a new evaluation protocol for perceptual robustness and release code and data to facilitate future improvements in reliable multimodal perception.

Abstract

Multimodal Large Language Models (MLLMs) have recently shown remarkable perceptual capability in answering visual questions, however, little is known about the limits of their perception. In particular, while prior works have provided anecdotal evidence of MLLMs' sensitivity to object size, this phenomenon and its underlying causes have not been explored comprehensively. In this work, we quantitatively study the perception of small visual objects in several state-of-the-art MLLMs and reveal a pervasive limitation in answering questions about small objects in images. Next, we identify four independent factors that can contribute to this limitation -- object quality, size, distractors, and location -- and conduct controlled intervention studies to measure the effect of each factor on MLLMs' perception. In particular, we find that lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions. More surprisingly, we find that the location of the object in the image and the presence of visual distractors can also significantly reduce MLLMs' question answering accuracy. Our study provides a better understanding of the perceptual limitation of MLLMs and contributes new evaluation protocols for analyzing the perception of future MLLMs. To facilitate further investigations, we release our code and data.

Exploring Perceptual Limitation of Multimodal Large Language Models

TL;DR

This work investigates why state-of-the-art multimodal large language models struggle to perceive small objects in visual inputs. It conducts a large-scale, controlled study across seven MLLMs on GQA and TextVQA, isolating four factors—object quality, size, distractors, and position—that affect perception of small objects. The results reveal universal size-related degradation and model-specific sensitivities to quality, distractors, and location, along with notable positional biases due to training data and patch-based processing. The authors propose a new evaluation protocol for perceptual robustness and release code and data to facilitate future improvements in reliable multimodal perception.

Abstract

Multimodal Large Language Models (MLLMs) have recently shown remarkable perceptual capability in answering visual questions, however, little is known about the limits of their perception. In particular, while prior works have provided anecdotal evidence of MLLMs' sensitivity to object size, this phenomenon and its underlying causes have not been explored comprehensively. In this work, we quantitatively study the perception of small visual objects in several state-of-the-art MLLMs and reveal a pervasive limitation in answering questions about small objects in images. Next, we identify four independent factors that can contribute to this limitation -- object quality, size, distractors, and location -- and conduct controlled intervention studies to measure the effect of each factor on MLLMs' perception. In particular, we find that lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions. More surprisingly, we find that the location of the object in the image and the presence of visual distractors can also significantly reduce MLLMs' question answering accuracy. Our study provides a better understanding of the perceptual limitation of MLLMs and contributes new evaluation protocols for analyzing the perception of future MLLMs. To facilitate further investigations, we release our code and data.
Paper Structure (17 sections, 13 figures, 3 tables)

This paper contains 17 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Failure cases of GPT-4V OpenAI2023GPT4TR in perceiving small objects when serving as web agents. Our research studies this perceptual limitation in several recent MLLMs.
  • Figure 2: The performances of multiple popular MLLMs on GQA and TextVQA show a clear positive correlation with relative size of target objects. The accuracy is computed with inclusion match. *A small part of questions is skipped due to safety policy of API models. $^\dagger$The model has been reported to be trained on the dataset.
  • Figure 3: An illustration of the Downsample-Upsample (upper) and Crop-Upsample (lower) procedure described in \ref{['sec:quality']} and \ref{['sec:size']}. The upper process reduces object quality 6 times while keeping the same size and position. The lower increases object size three times while keeping the object quality.
  • Figure 4: The effect of changing text sampling rate (quality) on model's performance of reading texts while keeping the size of the text. It is noticeable that from the sampling rate of 8 (marked as red), the image starts to become fully recognizable as '5934549'.
  • Figure 5: The effect of changing text size on model's performance of reading texts while keeping the sampling rate of the text.
  • ...and 8 more figures