Table of Contents
Fetching ...

The Interplay of Attention and Memory in Visual Enumeration

B. Sankar, Devottama Sen, Dibakar Sen

TL;DR

The paper investigates how attention and memory interact during visual enumeration across large, realistic visual fields using an immersive VR setup (RAVEN-VR). By comparing two phases—abstract shapes and real-world object images—the study shows that task intent (whether counting all, selectively including, or selectively excluding) is the dominant determinant of performance, with selective filtering substantially increasing time, reducing accuracy, and elevating cognitive load. Semantic processing of real-world objects amplifies these costs and can severely suppress memory recall, more so than the spatial layout of items. Across phases, gaze patterns reveal that attention is guided by top-down task demands, while the environment’s structure exerts a secondary influence; memory encoding suffers when cognitive demands are high, illustrating a tight coupling between attention and memory under cognitive load. Overall, the work demonstrates that real-world enumeration is constrained by semantic processing demands as much as by visual search, and VR enables principled, scalable study of these dynamics in ecologically valid contexts.

Abstract

Humans navigate and understand complex visual environments by subconsciously quantifying what they see, a process known as visual enumeration. However, traditional studies using flat screens fail to capture the cognitive dynamics of this process over the large visual fields of real-world scenes. To address this gap, we developed an immersive virtual reality system with integrated eye-tracking to investigate the interplay between attention and memory during complex enumeration. We conducted a two-phase experiment where participants enumerated scenes of either simple abstract shapes or complex real-world objects, systematically varying the task intent (e.g., selective vs. exhaustive counting) and the spatial layout of items. Our results reveal that task intent is the dominant factor driving performance, with selective counting imposing a significant cognitive cost that was dramatically amplified by stimulus complexity. The semantic processing required for real-world objects reduced accuracy and suppressed memory recall, while the influence of spatial layout was secondary and statistically non-significant when a higher-order cognitive task intent was driving the human behaviour. We conclude that real-world enumeration is fundamentally constrained by the cognitive load of semantic processing, not just the mechanics of visual search. Our findings demonstrate that under high cognitive demand, the effort to understand what we are seeing directly limits our capacity to remember it.

The Interplay of Attention and Memory in Visual Enumeration

TL;DR

The paper investigates how attention and memory interact during visual enumeration across large, realistic visual fields using an immersive VR setup (RAVEN-VR). By comparing two phases—abstract shapes and real-world object images—the study shows that task intent (whether counting all, selectively including, or selectively excluding) is the dominant determinant of performance, with selective filtering substantially increasing time, reducing accuracy, and elevating cognitive load. Semantic processing of real-world objects amplifies these costs and can severely suppress memory recall, more so than the spatial layout of items. Across phases, gaze patterns reveal that attention is guided by top-down task demands, while the environment’s structure exerts a secondary influence; memory encoding suffers when cognitive demands are high, illustrating a tight coupling between attention and memory under cognitive load. Overall, the work demonstrates that real-world enumeration is constrained by semantic processing demands as much as by visual search, and VR enables principled, scalable study of these dynamics in ecologically valid contexts.

Abstract

Humans navigate and understand complex visual environments by subconsciously quantifying what they see, a process known as visual enumeration. However, traditional studies using flat screens fail to capture the cognitive dynamics of this process over the large visual fields of real-world scenes. To address this gap, we developed an immersive virtual reality system with integrated eye-tracking to investigate the interplay between attention and memory during complex enumeration. We conducted a two-phase experiment where participants enumerated scenes of either simple abstract shapes or complex real-world objects, systematically varying the task intent (e.g., selective vs. exhaustive counting) and the spatial layout of items. Our results reveal that task intent is the dominant factor driving performance, with selective counting imposing a significant cognitive cost that was dramatically amplified by stimulus complexity. The semantic processing required for real-world objects reduced accuracy and suppressed memory recall, while the influence of spatial layout was secondary and statistically non-significant when a higher-order cognitive task intent was driving the human behaviour. We conclude that real-world enumeration is fundamentally constrained by the cognitive load of semantic processing, not just the mechanics of visual search. Our findings demonstrate that under high cognitive demand, the effort to understand what we are seeing directly limits our capacity to remember it.

Paper Structure

This paper contains 83 sections, 22 figures.

Figures (22)

  • Figure 1: A typical real-world scene, a formal garden, which humans subconsciously quantify to build a mental model. Describing this scene naturally involves enumerating key features like fountains, benches, and statues.
  • Figure 2: Comparison of subitizing and counting across different types of visual stimuli, demonstrating the transition from rapid enumeration to effortful counting as complexity increases.
  • Figure 3: A diagram of the human visual field, illustrating the distinction between the narrow, high-acuity central vision (inner cones) and the broader, low-resolution peripheral vision (outer arcs). Traditional studies often confine stimuli to the central field[cite: 902].
  • Figure 4: Overview of the RAVEN-VR experimental setup. The left column (a, c, e) shows the three spatial layouts for the Phase 1 task with abstract dot stimuli. The right column (b, d, f) shows the same layouts for the Phase 2 task with complex real-world images.
  • Figure 5: Distributions of Task Completion Time in Phase 1, analyzed by (a) Instruction Type and (b) Spatial Layout.
  • ...and 17 more figures