Table of Contents
Fetching ...

Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations

Pengzhou Cheng, Lingzhong Dong, Zeng Wu, Zongru Wu, Xiangru Tang, Chengwei Qin, Zhuosheng Zhang, Gongshen Liu

TL;DR

Agent-ScanKit introduces a perturbation-based framework to disentangle memory and reasoning in multimodal GUI agents without internal access. It applies visual-, text-, and structure-guided perturbations to measure perturbation sensitivity and compares performance across 18 agents on five GUI benchmarks, revealing a predominance of memorization over genuine reasoning. The findings indicate that current models largely function as retrievers of training-aligned knowledge, with substantial generalization gaps on long-horizon or cross-platform tasks, though RL with chain-of-thought improves language-side reasoning. These insights and the accompanying toolkit offer a principled path to diagnosing and strengthening reasoning mechanisms in practical multimodal agents.

Abstract

Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose \textbf{Agent-ScanKit}, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.

Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations

TL;DR

Agent-ScanKit introduces a perturbation-based framework to disentangle memory and reasoning in multimodal GUI agents without internal access. It applies visual-, text-, and structure-guided perturbations to measure perturbation sensitivity and compares performance across 18 agents on five GUI benchmarks, revealing a predominance of memorization over genuine reasoning. The findings indicate that current models largely function as retrievers of training-aligned knowledge, with substantial generalization gaps on long-horizon or cross-platform tasks, though RL with chain-of-thought improves language-side reasoning. These insights and the accompanying toolkit offer a principled path to diagnosing and strengthening reasoning mechanisms in practical multimodal agents.

Abstract

Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose \textbf{Agent-ScanKit}, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.

Paper Structure

This paper contains 32 sections, 8 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Pipeline of existing multimodal agents for GUI tasks. However, their poor reliability may stem from reliance on memorization rather than genuine reasoning.
  • Figure 2: Comparative performance of 7$\sim$8B multimodal agents on two evaluation metrics in GUI tasks. RL-based models are highlighted in red, while reasoning-enabled models are marked with "*". Low-level provides atomic instructions based on queries, whereas high-level only offers the query.
  • Figure 3: Task success rates for multimodal agents across five datasets. Models on the x-axis are grouped by training paradigm. The y-axis lists datasets, with parentheses indicating each dataset’s average interaction lengths (Avg Steps). "*" denotes models providing CoT for action reasoning.
  • Figure 4: Overview of the Agent-ScanKit framework. The framework systematically probes multimodal agents with controlled perturbations along visual, textual, and structural dimensions, revealing the interplay between memory and genuine reasoning.
  • Figure 5: Distribution of memory and reasoning across multimodal agents of different scales.
  • ...and 5 more figures