Table of Contents
Fetching ...

Information-Seeking Decision Strategies Mitigate Risk in Dynamic, Uncertain Environments

Nicholas W. Barendregt, Joshua I. Gold, Krešimir Josić, Zachary P. Kilpatrick

TL;DR

The paper addresses how agents balance information gathering against direct reward pursuit in dynamic, uncertain environments. It develops a Bayesian, sequential two-alternative foraging framework to compare reward-maximizing (rewardmax) and information-maximizing (infomax) strategies using dynamic programming, with key parameters such as environmental change probability $\epsilon$, reward reliability $q$, and future discounting $\gamma$. Results show that while rewardmax attains higher average rewards, infomax delivers more robust and consistent reward distributions, with both strategies displaying similar phase transitions between exploration and exploitation as conditions shift. These findings highlight the adaptive value of information-seeking behavior in naturalistic settings and inform broader discussions of decision-making in dynamic, uncertain environments, including connections to POMDPs and future reward discounting.

Abstract

To survive in dynamic and uncertain environments, individuals must develop effective decision strategies that balance information gathering and decision commitment. Models of such strategies often prioritize either optimizing tangible payoffs, like reward rate, or gathering information to support a diversity of (possibly unknown) objectives. However, our understanding of the relative merits of these two approaches remains incomplete, in part because direct comparisons have been limited to idealized, static environments that lack the dynamic complexity of the real world. Here we compared the performance of normative reward- and information-seeking strategies in a dynamic foraging task. Both strategies show similar transitions between exploratory and exploitative behaviors as environmental uncertainty changes. However, we find subtle disparities in the actions they take, resulting in meaningful performance differences: whereas reward-seeking strategies generate slightly more reward on average, information-seeking strategies provide more consistent and predictable outcomes. Our findings support the adaptive value of information-seeking behaviors that can mitigate risk with minimal reward loss.

Information-Seeking Decision Strategies Mitigate Risk in Dynamic, Uncertain Environments

TL;DR

The paper addresses how agents balance information gathering against direct reward pursuit in dynamic, uncertain environments. It develops a Bayesian, sequential two-alternative foraging framework to compare reward-maximizing (rewardmax) and information-maximizing (infomax) strategies using dynamic programming, with key parameters such as environmental change probability , reward reliability , and future discounting . Results show that while rewardmax attains higher average rewards, infomax delivers more robust and consistent reward distributions, with both strategies displaying similar phase transitions between exploration and exploitation as conditions shift. These findings highlight the adaptive value of information-seeking behavior in naturalistic settings and inform broader discussions of decision-making in dynamic, uncertain environments, including connections to POMDPs and future reward discounting.

Abstract

To survive in dynamic and uncertain environments, individuals must develop effective decision strategies that balance information gathering and decision commitment. Models of such strategies often prioritize either optimizing tangible payoffs, like reward rate, or gathering information to support a diversity of (possibly unknown) objectives. However, our understanding of the relative merits of these two approaches remains incomplete, in part because direct comparisons have been limited to idealized, static environments that lack the dynamic complexity of the real world. Here we compared the performance of normative reward- and information-seeking strategies in a dynamic foraging task. Both strategies show similar transitions between exploratory and exploitative behaviors as environmental uncertainty changes. However, we find subtle disparities in the actions they take, resulting in meaningful performance differences: whereas reward-seeking strategies generate slightly more reward on average, information-seeking strategies provide more consistent and predictable outcomes. Our findings support the adaptive value of information-seeking behaviors that can mitigate risk with minimal reward loss.

Paper Structure

This paper contains 12 sections, 25 equations, 6 figures.

Figures (6)

  • Figure 1: Dynamic Foraging Task and Modeling Approach.A: Schematic of an example realization of the dynamic foraging task. Between decisions, the environment is in a constant, hidden state $s^i$ that the agent infers from a sequence of observations, $\xi^i_{1:n}$. Once sufficiently confident, the agent makes decision $d^i$ and receives a probabilistic reward $r^i$ with reliability $q$. The environment then changes state with probability $\epsilon$, and the process begins again. B: Schematic of the dynamic programming approach for objective-maximizing behavior. At each time step the agent uses prior information to calculate the expected utility of sampling evidence and committing to a decision. The agent then takes the action that maximizes expected utility (either reward or information) and updates their belief based on the information gained from the environment (in the case of sampling) or from probabilistic reward (in the case of commitment).
  • Figure 2: Both Strategies Predict Explore-Exploit Phase Transitions.A,B: Example realization of the reward-maximizing strategy in state-likelihood space $p_n^i$ (A) and action space (B). In B, upward (downward) stems denote decisions towards $s_+$ ($s_-$), and green stems with circles (red spines with strikes) denote decisions that are rewarded (punished). Action numbers without stems denote actions for which the agent samples the environment for evidence. From each realization, we extract two behavioral metrics: 1) the average number of sequential commitments within a commit burst (e.g., purple), and 2) the average number of sequential samples within a sample burst (e.g., orange). In this particular realization, there are four commit bursts of lengths 2, 1, 2, and 1, and there are four sample bursts of length 1. C: Ratio of average sample burst length to average commit burst length for rewardmax behavior as functions of environmental stability ($\epsilon$) and reward reliability ($q$). In the realization shown in B, the average sample burst length is 1, and the average commit length is 1.5, resulting in a ratio metric of $\frac{2}{3}$. In the gray, dotted region of parameter space this ratio is unity. Values larger than unity indicate more exploratory behavior, whereas values smaller than unity indicate more exploitative behavior. Black, dashed line shows approximate location of the phase-transition boundary given by Eq. \ref{['eq:phase_transition_approximation_parameter_regions']}. As the environment becomes more stable ($\epsilon$ decreases) and provides more reliable feedback ($q$ increases), the agent opts to never sample, relying on pure exploitation that reliably delivers reward and evidence. Results generated with $10^4$ realizations, each with Bernoulli-evidence parameter $h=0.75$, time step budget $N=10$, action time step costs $(\tau_d,\tau_s)=(1,1)$, reward structure $\left(R_c,R_i\right)=(100,\text{-}100)$, and no discounting ($\gamma=1$). D: Same as C, but for infomax behavior. Infomax behavior exhibits a similar, but less sharp, phase transition to pure exploitation than rewardmax.
  • Figure 3: Action Alignment of Rewardmax and Infomax Behavior.A: Schematic of the alignment metric. For both models, we generate an action-space representation (using the same notation as in Fig. \ref{['fig:explore_exploit_phase_transition']}B) from a common environmental belief realization. We quantified the similarity of behaviors generated by the two models as "action alignment," defined as the proportion of identical actions (marked with stars) when presented with the same belief about the current environmental state. B: Alignment of rewardmax and infomax behaviors as a function of environmental stability ($\epsilon$) and reward reliability ($q$) and progressing from no temporal discounting (left) to full temporal discounting of reward (right). Strategies are most distinct (i.e., alignment is lowest) in environments with intermediate stability (moderate $\epsilon$) and high reliability (large $q$). This distinctness is emphasized as future utility is increasingly discounted ($\gamma\to0$). Results generated with $10^4$ realizations of common belief trajectories with $\gamma\in\{1,0.5,0\}$ (left-to-right panels) and all other task parameters as in Fig. \ref{['fig:explore_exploit_phase_transition']}.
  • Figure 4: Robustness of the Infomax Model.A: Average normalized reward rate differential $\Delta\rho=\frac{\rho_{\text{rm}}-\rho_{\text{im}}}{R_c}$ as a function of environmental stability, $\epsilon$, and reward reliability, $q$, progressing from no temporal discounting ($\gamma=1$, left) to full temporal discounting ($\gamma=0$, right) of reward. B: Schematic of reward rate distributions for both models. The rewardmax strategy tends to maximize the reward obtained, on average, but often with more variability than the infomax strategy across repeated instances of the same task conditions. Infomax thus guards against precipitously low reward returns by producing consistently adequate rewards. C: Robustness differential $\Delta\kappa=\kappa_{\text{rm}}-\kappa_{\text{im}}$, with a model's robustness defined as $\kappa=\frac{\langle\tilde{\rho}\rangle}{\text{std}(\tilde{\rho})}$. Even as rewardmax yields larger average reward rates, infomax delivers a more narrow distribution of reward rates, making it a more robust strategy. This robustness advantage is enhanced as discounting on future utility increases.
  • Figure Sup-1: Rewardmax Phase Transition Behavior.A: Ratio of average sample burst length to average commit burst length for rewardmax behavior as in Fig. \ref{['fig:explore_exploit_phase_transition']}C, but with action time step costs $(\tau_c,\tau_s)=(2,1)$ (left) and $(\tau_c,\tau_s)=(1,2)$ (right). B: Same as A, but for environmental evidence Bernoulli parameter $h=0.55$ (left) and $h=0.95$ (right). C: Same as A, but for reward structure $(R_c,R_i)=(110,-100)$ (left) and $(R_c,R_i)=(100,-110)$ (right). D: Same as A, but for action budget $N=5$ (left) and $N=25$ (right). Unless specifically altered, all other task parameters are the same as in Fig. \ref{['fig:explore_exploit_phase_transition']}.
  • ...and 1 more figures