The Interplay of Attention and Memory in Visual Enumeration
B. Sankar, Devottama Sen, Dibakar Sen
TL;DR
The paper investigates how attention and memory interact during visual enumeration across large, realistic visual fields using an immersive VR setup (RAVEN-VR). By comparing two phases—abstract shapes and real-world object images—the study shows that task intent (whether counting all, selectively including, or selectively excluding) is the dominant determinant of performance, with selective filtering substantially increasing time, reducing accuracy, and elevating cognitive load. Semantic processing of real-world objects amplifies these costs and can severely suppress memory recall, more so than the spatial layout of items. Across phases, gaze patterns reveal that attention is guided by top-down task demands, while the environment’s structure exerts a secondary influence; memory encoding suffers when cognitive demands are high, illustrating a tight coupling between attention and memory under cognitive load. Overall, the work demonstrates that real-world enumeration is constrained by semantic processing demands as much as by visual search, and VR enables principled, scalable study of these dynamics in ecologically valid contexts.
Abstract
Humans navigate and understand complex visual environments by subconsciously quantifying what they see, a process known as visual enumeration. However, traditional studies using flat screens fail to capture the cognitive dynamics of this process over the large visual fields of real-world scenes. To address this gap, we developed an immersive virtual reality system with integrated eye-tracking to investigate the interplay between attention and memory during complex enumeration. We conducted a two-phase experiment where participants enumerated scenes of either simple abstract shapes or complex real-world objects, systematically varying the task intent (e.g., selective vs. exhaustive counting) and the spatial layout of items. Our results reveal that task intent is the dominant factor driving performance, with selective counting imposing a significant cognitive cost that was dramatically amplified by stimulus complexity. The semantic processing required for real-world objects reduced accuracy and suppressed memory recall, while the influence of spatial layout was secondary and statistically non-significant when a higher-order cognitive task intent was driving the human behaviour. We conclude that real-world enumeration is fundamentally constrained by the cognitive load of semantic processing, not just the mechanics of visual search. Our findings demonstrate that under high cognitive demand, the effort to understand what we are seeing directly limits our capacity to remember it.
