Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations
Pengzhou Cheng, Lingzhong Dong, Zeng Wu, Zongru Wu, Xiangru Tang, Chengwei Qin, Zhuosheng Zhang, Gongshen Liu
TL;DR
Agent-ScanKit introduces a perturbation-based framework to disentangle memory and reasoning in multimodal GUI agents without internal access. It applies visual-, text-, and structure-guided perturbations to measure perturbation sensitivity and compares performance across 18 agents on five GUI benchmarks, revealing a predominance of memorization over genuine reasoning. The findings indicate that current models largely function as retrievers of training-aligned knowledge, with substantial generalization gaps on long-horizon or cross-platform tasks, though RL with chain-of-thought improves language-side reasoning. These insights and the accompanying toolkit offer a principled path to diagnosing and strengthening reasoning mechanisms in practical multimodal agents.
Abstract
Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose \textbf{Agent-ScanKit}, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.
