Table of Contents
Fetching ...

Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual Foraging

Bo Wang, Dingwei Tan, Yen-Ling Kuo, Zhaowei Sun, Jeremy M. Wolfe, Tat-Jen Cham, Mengmi Zhang

TL;DR

A transformer-based Visual Forager model trained via reinforcement learning that outperforms all baselines, achieves cumulative rewards comparable to those of humans, and approximates human foraging behavior in eye movements and foraging biases within time-limited environments is developed.

Abstract

Imagine searching a collection of coins for quarters ($0.25$), dimes ($0.10$), nickels ($0.05$), and pennies ($0.01$)-a hybrid foraging task where observers look for multiple instances of multiple target types. In such tasks, how do target values and their prevalence influence foraging and eye movement behaviors (e.g., should you prioritize rare quarters or common nickels)? To explore this, we conducted human psychophysics experiments, revealing that humans are proficient reward foragers. Their eye fixations are drawn to regions with higher average rewards, fixation durations are longer on more valuable targets, and their cumulative rewards exceed chance, approaching the upper bound of optimal foragers. To probe these decision-making processes of humans, we developed a transformer-based Visual Forager (VF) model trained via reinforcement learning. Our VF model takes a series of targets, their corresponding values, and the search image as inputs, processes the images using foveated vision, and produces a sequence of eye movements along with decisions on whether to collect each fixated item. Our model outperforms all baselines, achieves cumulative rewards comparable to those of humans, and approximates human foraging behavior in eye movements and foraging biases within time-limited environments. Furthermore, stress tests on out-of-distribution tasks with novel targets, unseen values, and varying set sizes demonstrate the VF model's effective generalization. Our work offers valuable insights into the relationship between eye movements and decision-making, with our model serving as a powerful tool for further exploration of this connection. All data, code, and models are available at https://github.com/ZhangLab-DeepNeuroCogLab/visual-forager.

Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual Foraging

TL;DR

A transformer-based Visual Forager model trained via reinforcement learning that outperforms all baselines, achieves cumulative rewards comparable to those of humans, and approximates human foraging behavior in eye movements and foraging biases within time-limited environments is developed.

Abstract

Imagine searching a collection of coins for quarters (), dimes (), nickels (), and pennies ()-a hybrid foraging task where observers look for multiple instances of multiple target types. In such tasks, how do target values and their prevalence influence foraging and eye movement behaviors (e.g., should you prioritize rare quarters or common nickels)? To explore this, we conducted human psychophysics experiments, revealing that humans are proficient reward foragers. Their eye fixations are drawn to regions with higher average rewards, fixation durations are longer on more valuable targets, and their cumulative rewards exceed chance, approaching the upper bound of optimal foragers. To probe these decision-making processes of humans, we developed a transformer-based Visual Forager (VF) model trained via reinforcement learning. Our VF model takes a series of targets, their corresponding values, and the search image as inputs, processes the images using foveated vision, and produces a sequence of eye movements along with decisions on whether to collect each fixated item. Our model outperforms all baselines, achieves cumulative rewards comparable to those of humans, and approximates human foraging behavior in eye movements and foraging biases within time-limited environments. Furthermore, stress tests on out-of-distribution tasks with novel targets, unseen values, and varying set sizes demonstrate the VF model's effective generalization. Our work offers valuable insights into the relationship between eye movements and decision-making, with our model serving as a powerful tool for further exploration of this connection. All data, code, and models are available at https://github.com/ZhangLab-DeepNeuroCogLab/visual-forager.

Paper Structure

This paper contains 38 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Illustrative example of eye movements and decision-making in a hybrid visual foraging task. The image depicts a real-world scenario where the goal is to search piles of coins for multiple instances of target coins with varying monetary values in order to maximize the accumulative monetary reward, within a time-limited environment. Yellow dots and arrows represent the locations and order of eye movements during the search. Red bounding boxes show the target coins that are collected. Note that humans do not always collect every item they fixate on, highlighting the selective nature of the foraging process.
  • Figure 2: Schematic of the hybrid visual foraging experiment. Each foraging trial starts with a 2-second center fixation (omitted here for simplicity), followed by the presentation of target images and their associated values (e.g., a plant valued at 4). To ensure human participants memorize the targets and their values, they must pass a recognition test by selecting all targets among distractors and correctly matching their values. If they make errors, they repeat the target and value presentation phases. After another 2-second center fixation presentation, an object array is displayed. Both human and AI agents are tasked with collecting as many targets as possible through mouse clicks to maximize their total rewards, where rewards correspond to the values of the target objects, and a penalty of -1 is incurred for clicking on distractors. The trial ends either after 30 seconds or when 20 clicks are made.
  • Figure 3: Architecture overview of our Visual Forager. VF consists of three modules elaborated in \ref{['sec:modelVF']}: visual feature modulation from target images with foveated vision mimicking eccentricity-dependent sampling in human vision (\ref{['sec:visual feature']}), modulation from various values of different targets (\ref{['sec:value']}), and decision-making process with an actor-critic transformer architecture, outputting next fixation locations from predicted attention maps and the probability of clicking the currently fixated item (\ref{['sec:actor-critic']}).
  • Figure 4: Humans and AI models are reward-seeking agents. We report the normalized scores (Norm. Score) as a function of click numbers for humans (red), our VF model (blue), and other baseline models (varying gray). Chance is in black. Three experimental conditions of foraging trials are included with varying prevalence and values of target objects. See \ref{['sec: baseline']} for evaluation metrics and baselines, and \ref{['sec: conditions']} for experimental conditions.
  • Figure 5: (A) Our VF model has consistent clicking biases with humans. Humans (red) and our VF models (blue) share the same signs of CBR for most targets under UnValEqPre (a) and UnValUnPre (b). Chance (gray) has no preferences over target objects; hence, a CBR of 0. (B) Our VF model approximates humans in saccade size distributions. Saccade size distributions for humans (red), our VF model (blue), and our VF model with eccentricity removed (light blue) are presented. Vertical dash lines in colors indicate their mean saccade sizes in visual angle degrees. (C) Our VF model can generalize to out-of-distribution hybrid foraging tasks. Spider plot shows Norm.Score for humans (red), our VF (blue) and FeatOnly baseline (black dotted) under 7 experimental conditions.
  • ...and 6 more figures