Table of Contents
Fetching ...

A computational approach to visual ecology with deep reinforcement learning

Sacha Sokoloski, Jure Majnik, Philipp Berens

TL;DR

The paper introduces a deep reinforcement learning framework to study visual ecology by framing animal survival as the sole objective in a ViZDoom-based foraging task. It shows that the complexity of the agent's vision model must scale with the visual complexity of food, and that recurrent architectures are crucial to exploiting complex visual inputs on demanding tasks. The authors demonstrate that different brain architectures produce distinct representations of value and behavior, with satiety signals further shaping strategies and reducing nutritional waste. This work provides a computational platform and benchmarks for investigating how perception and value emerge under survival-driven objectives, offering insights into neural coding in visually rich ecological niches.

Abstract

Animal vision is thought to optimize various objectives from metabolic efficiency to discrimination performance, yet its ultimate objective is to facilitate the survival of the animal within its ecological niche. However, modeling animal behavior in complex environments has been challenging. To study how environments shape and constrain visual processing, we developed a deep reinforcement learning framework in which an agent moves through a 3-d environment that it perceives through a vision model, where its only goal is to survive. Within this framework we developed a foraging task where the agent must gather food that sustains it, and avoid food that harms it. We first established that the complexity of the vision model required for survival on this task scaled with the variety and visual complexity of the food in the environment. Moreover, we showed that a recurrent network architecture was necessary to fully exploit complex vision models on the most visually demanding tasks. Finally, we showed how different network architectures learned distinct representations of the environment and task, and lead the agent to exhibit distinct behavioural strategies. In summary, this paper lays the foundation for a computational approach to visual ecology, provides extensive benchmarks for future work, and demonstrates how representations and behaviour emerge from an agent's drive for survival.

A computational approach to visual ecology with deep reinforcement learning

TL;DR

The paper introduces a deep reinforcement learning framework to study visual ecology by framing animal survival as the sole objective in a ViZDoom-based foraging task. It shows that the complexity of the agent's vision model must scale with the visual complexity of food, and that recurrent architectures are crucial to exploiting complex visual inputs on demanding tasks. The authors demonstrate that different brain architectures produce distinct representations of value and behavior, with satiety signals further shaping strategies and reducing nutritional waste. This work provides a computational platform and benchmarks for investigating how perception and value emerge under survival-driven objectives, offering insights into neural coding in visually rich ecological niches.

Abstract

Animal vision is thought to optimize various objectives from metabolic efficiency to discrimination performance, yet its ultimate objective is to facilitate the survival of the animal within its ecological niche. However, modeling animal behavior in complex environments has been challenging. To study how environments shape and constrain visual processing, we developed a deep reinforcement learning framework in which an agent moves through a 3-d environment that it perceives through a vision model, where its only goal is to survive. Within this framework we developed a foraging task where the agent must gather food that sustains it, and avoid food that harms it. We first established that the complexity of the vision model required for survival on this task scaled with the variety and visual complexity of the food in the environment. Moreover, we showed that a recurrent network architecture was necessary to fully exploit complex vision models on the most visually demanding tasks. Finally, we showed how different network architectures learned distinct representations of the environment and task, and lead the agent to exhibit distinct behavioural strategies. In summary, this paper lays the foundation for a computational approach to visual ecology, provides extensive benchmarks for future work, and demonstrates how representations and behaviour emerge from an agent's drive for survival.
Paper Structure (11 sections, 6 figures, 1 table)

This paper contains 11 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of the computational framework.a: A depiction (top) of the agent (mouse) in an environment (green field), surrounded by nourishment (red dots) and poison (blue dots). The agent perceives the environment through its viewport (bottom). b: The brain model of the agent processes the viewport with a CNN modelled after early animal vision (bottom). The output of the vision model and the satiety signal feed into the FC layers which can also be modulated by input satiety, and which ultimately outputs into a GRU. The policy and estimated value function (top) are linear functions of the latent, GRU state. c: Task complexity varies with visual complexity, from food represented by apples to CIFAR-10 images.
  • Figure 2: Benchmarking vision models and brain architectures.a: Sample viewports from the apples (red frame), Gabors (blue frame), MNIST (green frame), and CIFAR-10 task (purple frame). b: Median (points) and min-max range (filled area) of 3 training histories of an RNN on each task. Each training history is composed of 10,0000 samples and smoothed with a sliding Gaussian window of size 51. c: Average lifespan of the linear (squares), FF (filled triangles), FF-IS (empty triangles), RNN (filled circles), and RNN-IS (empty circles) brains. Lifespan computed as the average over the last 500 steps of the smoothed training histories. d: Lifespan of FF (triangles) and RNN agents (circles) as a function of base channels $n_{BC}$. e: Lifespan of the FF and RNN agents as a function of LGN size $n_{LGN}$. f: Lifespan of the FF and RNN as a function of latent space size $n_{FC}$.
  • Figure 3: Characterizing the discrimination performance of different architectures. Median frequency of pickups of poison and nourishment for the FF (light blue), FF-IS (blue), RNN (purple), and RNN-IS (magenta) agents on a: the apples task, b: the Gabors task, c: the MNIST task, and d: the CIFAR-10 task.
  • Figure 4: Qualitative analyses of the estimated value function. Each row visualizes a 1000 frame simulation of a trained agent on the MNIST task. a: Sample viewport from trained models with the FF, FF-IS, RNN, and RNN-IS architectures. b: Sensitivity of $\hat{V}$ of each architecture to the corresponding viewports in a based on the method of integrated gradients. c: The satiety (gold line) and $\hat{V}$ (purple line) of the agent over time. The dashed gold lines indicate a task reset, and the dashed black line indicates the time of the corresponding viewport shown in a.
  • Figure 5: Regression analysis of the estimated value function. 10-fold cross-validated $r^2$ for the linear prediction of $\hat{V}$ given distinct regressors. Colours indicate task, points indicate median $r^2$, filled areas indicate min-max $r^2$, and bounding lines indicate performance upper bounds based on the estimated intrinsic noise of $\hat{V}$. a: Regression of $\hat{V}$ at time $t$ on agent satiety at time $t$. a: Regression of $\hat{V}$ on the time in frames (countdown) until the next food is consumed. c: Regression of $\hat{V}$ on both satiety and food countdown.
  • ...and 1 more figures