Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models

Sichu Liang; Hongyu Zhu; Wenwen Wang; Deyu Zhou

Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models

Sichu Liang, Hongyu Zhu, Wenwen Wang, Deyu Zhou

TL;DR

This study interrogates whether vision can substitute for text as a memory substrate in multimodal language systems by employing a controlled spatial n-back task across text-grid and vision-grid formats. Using Qwen2.5-7B-Instruct and Qwen2.5-VL-7B-Instruct, the authors quantify performance with traditional WM metrics and a trial-level evidence score derived from log-probabilities, supplemented by a lag-scan diagnostic and grid-size manipulation. They find a robust modality gap (text-grid > vision-grid), with performance deteriorating as memory load increases, and show that many effects arise from recency-based strategies and recent-repeat interference rather than pure memory capacity limits. Across model families and scales, these results converge on the conclusion that representational code materially shapes WM-like computations, challenging the assumption that text-to-vision substitution preserves the underlying cognitive processes and calling for computation-focused multimodal evaluation.

Abstract

Working memory is a central component of intelligent behavior, providing a dynamic workspace for maintaining and updating task-relevant information. Recent work has used n-back tasks to probe working-memory-like behavior in large language models, but it is unclear whether the same probe elicits comparable computations when information is carried in a visual rather than textual code in vision-language models. We evaluate Qwen2.5 and Qwen2.5-VL on a controlled spatial n-back task presented as matched text-rendered or image-rendered grids. Across conditions, models show reliably higher accuracy and d' with text than with vision. To interpret these differences at the process level, we use trial-wise log-probability evidence and find that nominal 2/3-back often fails to reflect the instructed lag and instead aligns with a recency-locked comparison. We further show that grid size alters recent-repeat structure in the stimulus stream, thereby changing interference and error patterns. These results motivate computation-sensitive interpretations of multimodal working memory.

Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models

TL;DR

Abstract

Paper Structure (25 sections, 1 equation, 5 figures, 1 table)

This paper contains 25 sections, 1 equation, 5 figures, 1 table.

Introduction
Method
Models as Participants
Task, Stimuli, and Design
Text-grid condition.
Vision-grid condition.
Stimulus readability check (location recognition).
Trial sequence generation.
Procedure
Dependent Measures
Behavioral measures (Accuracy, $H/FA$, and $d'$).
Evidence-based discriminability (AUC).
Results
Main Results
Modality effect.
...and 10 more sections

Figures (5)

Figure 1: Spatial n-back task and multimodal presentation. (a) A spatial n-back block consists of a temporal sequence of grids; the target cell is marked by X. A trial is labeled as Match if the current location matches the location $n$ steps back. (b) The same trial stream is presented in two formats: a text-rendered grid (text modality) and an image-rendered grid (vision modality), with trial-wise decision evidence extracted from the forced-choice responses.
Figure 2: Main results on spatial n-back. Hit rate, false-alarm rate, accuracy, and $d'$ versus grid size $N$ for $n{=}1,2,3$ (rows). Curves compare Qwen2.5-7B (text-grid), Qwen2.5-VL-7B (text-grid), and Qwen2.5-VL-7B (vision-grid); bands show $\pm1$ SEM (standard error of the mean) across blocks. Dashed lines mark chance-level references (50% for rates/accuracy; $d'{=}0$).
Figure 3: Robustness across model families and scale. Sensitivity ($d'$) versus grid size $N$ for $n{=}1,2,3$ (rows) in two LLM/VLM pairs (left: Llama3.1-8B/Llama3.2-11B-Vision; right: Qwen2.5-32B/Qwen2.5-VL-32B). For each pair, we compare LLM(text-grid), VLM(text-grid), and VLM(vision-grid); bands show $\pm1$ SEM across blocks.
Figure 4: Lag-scan diagnostic. Using the evidence $s_t$, we compute AUC against match/non-match labels defined by assumed lags $k\in\{1,2,3\}$ for each nominal instruction $n$. AUC at the instructed definition ($k{=}n$, shaded) drops toward chance for $n{=}2,3$. Across $n$, AUC peaks at $k{=}1$ rather than $k{=}n$.
Figure 5: Explaining the grid-size advantage via recent-repeat interference (in 1-back). (A) Recent-repeat lure rate decreases with grid size $N$ (in all $W\in\{3,5,7\}$). (B) Lure non-matches: discriminability (AUC from $s_t$) is lower when contrasting matches against lure non-matches than against novel non-matches for $N\ge4$, indicating that recent repeats make non-match trials more confusable. (C) Lure matches: in small grids, match trials in a recent-repeat context receive lower (often negative) evidence $s_t$, indicating suppressed match evidence under dense repetition. (D) Small grids show a block-level collapse toward always predicting non-match (match-response rate distribution; $3{\times}3$ median $=0$). Bands/error bars show $\pm1$ SEM across blocks; dashed lines mark 50% for AUC (%) and match response rate.

Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models

TL;DR

Abstract

Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)