Table of Contents
Fetching ...

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

Ziyue Wang, Chi Chen, Fuwen Luo, Yurui Dong, Yuanchi Zhang, Yuzhuang Xu, Xiaolong Wang, Peng Li, Yang Liu

TL;DR

ActiView tackles the gap in evaluating active perception for Multimodal LLMs by introducing a VQA-style benchmark that constrains perceptual fields and requires shifting and zooming. It provides three pipelines (zooming, shifting, and mixed) to test core abilities and their integration, along with interleaved multi-image inputs to reflect realistic multimodal processing. Across 30 models, results reveal a substantial gap to human performance and show that multi-image models and autonomous view strategies improve active perception, though large models sometimes struggle with mixed-instruction scenarios. The benchmark and findings aim to drive development of MLLMs capable of natural, holistic multimodal understanding under dynamic perceptual constraints, with broader implications for real-world AI systems.

Abstract

Active perception, a crucial human capability, involves setting a goal based on the current understanding of the environment and performing actions to achieve that goal. Despite significant efforts in evaluating Multimodal Large Language Models (MLLMs), active perception has been largely overlooked. To address this gap, we propose a novel benchmark named ActiView to evaluate active perception in MLLMs. We focus on a specialized form of Visual Question Answering (VQA) that eases and quantifies the evaluation yet challenging for existing MLLMs. Meanwhile, intermediate reasoning behaviors of models are also discussed. Given an image, we restrict the perceptual field of a model, requiring it to actively zoom or shift its perceptual field based on reasoning to answer the question successfully. We conduct extensive evaluation over 30 models, including proprietary and open-source models, and observe that restricted perceptual fields play a significant role in enabling active perception. Results reveal a significant gap in the active perception capability of MLLMs, indicating that this area deserves more attention. We hope that ActiView could help develop methods for MLLMs to understand multimodal inputs in more natural and holistic ways.

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

TL;DR

ActiView tackles the gap in evaluating active perception for Multimodal LLMs by introducing a VQA-style benchmark that constrains perceptual fields and requires shifting and zooming. It provides three pipelines (zooming, shifting, and mixed) to test core abilities and their integration, along with interleaved multi-image inputs to reflect realistic multimodal processing. Across 30 models, results reveal a substantial gap to human performance and show that multi-image models and autonomous view strategies improve active perception, though large models sometimes struggle with mixed-instruction scenarios. The benchmark and findings aim to drive development of MLLMs capable of natural, holistic multimodal understanding under dynamic perceptual constraints, with broader implications for real-world AI systems.

Abstract

Active perception, a crucial human capability, involves setting a goal based on the current understanding of the environment and performing actions to achieve that goal. Despite significant efforts in evaluating Multimodal Large Language Models (MLLMs), active perception has been largely overlooked. To address this gap, we propose a novel benchmark named ActiView to evaluate active perception in MLLMs. We focus on a specialized form of Visual Question Answering (VQA) that eases and quantifies the evaluation yet challenging for existing MLLMs. Meanwhile, intermediate reasoning behaviors of models are also discussed. Given an image, we restrict the perceptual field of a model, requiring it to actively zoom or shift its perceptual field based on reasoning to answer the question successfully. We conduct extensive evaluation over 30 models, including proprietary and open-source models, and observe that restricted perceptual fields play a significant role in enabling active perception. Results reveal a significant gap in the active perception capability of MLLMs, indicating that this area deserves more attention. We hope that ActiView could help develop methods for MLLMs to understand multimodal inputs in more natural and holistic ways.
Paper Structure (55 sections, 3 equations, 7 figures, 17 tables)

This paper contains 55 sections, 3 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: Active perception allows humans or models to perform more complex tasks by actively seeking and processing relevant information. In this paper, we evaluate two key active perception abilities for MLLMs: 1) shifting, as real-world scenarios often present limited views and require shifts to obtain new perspectives, and 2) zooming, which helps enhance perception by zooming out for a broader view and zooming in for details.
  • Figure 2: Examples of ActiView, exhibiting the following features: i) requiring focusing on multiple fine-grained regions; ii) requiring distinguishing distracting information from the entire image; iii) requiring moving of perceptual fields to obtain sufficient visual information to answer questions. During evaluation, models will be given an initial view cropped from the original image as shown above. Visual Information: human-annotated visual clues.
  • Figure 3: The statistical distribution of our benchmark.
  • Figure 4: Evaluation pipelines as described in §\ref{['sec:pipeline']}. (a) Zooming requires models to select multiple regions to zoom in. It tests one of the fundamental active perception abilities. (b) Shifting challenges models to ask for more necessary information. It tests the other fundamental active perception abilities. (c) Mixed simulates human behavior when shifting perceptual fields for missing information. It is more flexible and applicable in real life compare to the previous two fundamental abilities. Note that while we provide an example in the figure where model delete a zoomed sub-view, the deletion behavior is NOT required. It is to address the compound features of the mixed pipeline (c) compare to the other fundamental pipelines (a) and (b).
  • Figure 5: Cases for each evaluation pipelines. (a) a succeeded zooming case, (b) a failed shifting case, and (c) a mixed case that successfully corrects the wrong answer produced by (b). Model selected views for case (a) and (b) are placed to the right of example frames, and used views for case (c) are shown with in its frame as the selection of views changes during the evaluation process.
  • ...and 2 more figures