Sequential sampling without comparison to boundary through model-free reinforcement learning

Jamal Esmaily; Rani Moran; Yasser Roudi; Bahador Bahrami

Sequential sampling without comparison to boundary through model-free reinforcement learning

Jamal Esmaily, Rani Moran, Yasser Roudi, Bahador Bahrami

TL;DR

The paper tackles how agents learn when to commit in perceptual decisions without relying on an evidence-accumulation boundary. It introduces a model-free Q-learning framework with an additional Wait action, enabling sequential sampling to be governed by experience rather than a fixed threshold. The approach reproduces canonical psychometric and chronometric patterns, shows how payoff context shapes the speed-accuracy trade-off via the learned terminal state $B$, and demonstrates close alignment with reward-optimized solutions under many conditions, while also offering mechanisms to account for learning dynamics, observational learning, urgency, PES, and volatility effects. This boundary-free RL perspective provides a unifying, testable account of decision making that can leverage discarded training data and adapt to changing contexts without requiring explicit boundary computations.

Abstract

Although evidence integration to the boundary model has successfully explained a wide range of behavioral and neural data in decision making under uncertainty, how animals learn and optimize the boundary remains unresolved. Here, we propose a model-free reinforcement learning algorithm for perceptual decisions under uncertainty that dispenses entirely with the concepts of decision boundary and evidence accumulation. Our model learns whether to commit to a decision given the available evidence or continue sampling information at a cost. We reproduced the canonical features of perceptual decision-making such as dependence of accuracy and reaction time on evidence strength, modulation of speed-accuracy trade-off by payoff regime, and many others. By unifying learning and decision making within the same framework, this model can account for unstable behavior during training as well as stabilized post-training behavior, opening the door to revisiting the extensive volumes of discarded training data in the decision science literature.

Sequential sampling without comparison to boundary through model-free reinforcement learning

TL;DR

, and demonstrates close alignment with reward-optimized solutions under many conditions, while also offering mechanisms to account for learning dynamics, observational learning, urgency, PES, and volatility effects. This boundary-free RL perspective provides a unifying, testable account of decision making that can leverage discarded training data and adapt to changing contexts without requiring explicit boundary computations.

Abstract

Paper Structure (23 sections, 9 equations, 21 figures)

This paper contains 23 sections, 9 equations, 21 figures.

Introduction
Results
The setup
Evolution of the Q-table during learning
Model's behavior in perceptual decision making
Decision dynamics during learning
The impact of payoff structure on Speed-Accuracy Trade-off
Comparison with Optimizing Expected Reward
Deciding without evidence accumulation
Discussion
A toy example scenario
Effect of the Learning Rate $\epsilon$ and number of states $M$
SAT During learning: Observation Learning and Time Dependent Waiting
Observational learning
Time Dependant Waiting
...and 8 more sections

Figures (21)

Figure 1: Schematic illustration of the perceptual stimulus, trial structure, and model components. The structure of the task and three consecutive trials ($u-1$, $u$, $u+1$) are illustrated with time progression taking place from left to right (horizontal arrow) both during the trials and from one trial to the next. The variable widths of the white gaps between trials depict the random duration of inter-trial intervals. We assume that no update happens during these periods. In each trial, a random dot motion stimulus moves towards the left or right. The evidence ($E_t$ in Eq. \ref{['eqn:state']}) is sampled every time the agent chooses to wait (i.e., Wait action) and the state variable is updated by accumulating the evidence. This within-trial updating continues until the agent chooses one of the terminating actions (L, R) at which point the state and evidence variables are then set to zero and remain zero until the beginning of the presentation of the new motion stimulus in the next trial. The states at which these terminating actions are taken are the terminal state, indicated by the red circle in the plots denoted by States. At each time point, the agent receives a reward based on the action that it has taken. Unlike sequential sampling models, no comparison to any threshold is explicitly formalized in the model and taken by the model.
Figure 2: Evolution of the Q-table. (a) Snapshots of Q-values of each action at each state shown at the beginning of the learning (Trial number$=0$) where all Q-values are set to zero. The Q-values shown are averaged over 30 simulations with the same parameters. In each trial, $u$, the coherence level $c^{u}$ is chosen randomly and with equal probability from the set $\mathcal{C} = [-51.2\%,-25.6\%, ...0,...25.6\%,51.2\% ]$. As training proceeds, the Q-values associated with the Wait action (green) in the states around zero stay higher but those for the terminating actions (Left, and Right) drop to lower values. The Q-value for each terminating action exceeds that of the Wait on the side corresponding to the correct choice (i.e., red on the left and blue on the right - see arrows). (b) Terminal states initially emerge near zero and then, with training, move away from it toward rightward (blue) and leftward (red). Each thin lines show the results for one of the 30 simulations, and the tick line represents the average over those simulations. (c) Histograms showing the fraction of times that a state had the largest Q-value (when averaged over 30 simulations) during 4 different periods of learning, each comprising $600$ trials. As training progresses, the histograms shift away from zero and become narrower in spread; the solid curves are fitted to the histograms. In the simulations reported in this figure, $U=2400$ trials were used.
Figure 3: Model performance after training. The psychometric curve showing choice Accuracy (a), and the chronometric curve showing Reaction Time (RT) (b), both plotted as a function of the coherence level, $c$. Data from the simulations are denoted by black points and the lines in (a) and (b) show Eqs. \ref{['eqn:accan']} and \ref{['eqn:rtan']}. To plot these lines we fixed $B$ in Eq. \ref{['eqn:rtan']} and \ref{['eqn:accan']} to the average of the model's terminal state over $2400$ test trials during which the Q-table was left unchanged. The error bars show SEM over these trials. (c) Two examples of the states taken by the model as time progresses through the test trials where Q-table is fixed for two stimuli with opposing directions. (d) Q-values of the trained model for different actions in different states. The bumps appearing at states $\sim 20$ and $\sim -20$ indicate the location of the terminal states. All other parameters are the same as Fig. \ref{['fig:lr_q']}.
Figure 4: Changes in decision Accuracy and RT during training. (a) Accuracy increases as training progresses. Light grey curves show this for 30 individual simulations, with the same model parameters, smoothed through convolution with a unity array of size of 50. The solid black line shows the average over these simulations. (b) Changes in psychometric threshold (black) and lapse rate (red ) during training. The psychometric threshold is defined as the coherence level at which the model performs at 82$\%$ accuracy (e.g. $\alpha$ in Weibull CDF roitman_response_2002law_neural_2008) and the lapse rate is the error rate in trials with 100$\%$ coherence. Think curves are fits to the data points via similar functions used in law_neural_2008. The inset shows empirical data from macaque monkeys law_neural_2008. (c) Same as (a) but for the reaction times.
Figure 5: Speed accuracy trade-off (SAT) of the trained model (a) Choice Accuracy and (b) RT for different values of CBR: Large (4), Intermediate (3), and Small (2) indicated by different colors. (c) The terminal state for different values of CBR and different learning rates $\epsilon$. Increasing CBR pushes the terminal state further away from zero, producing the dependence shown in (a)-(b). Curves are smoothed using a moving average filter and $U = 900$. CBR values were changed from $0.01$ to $10^{5}$ in equal logarithmic steps, by fixing $R_{\rm{correct}} = 20$ and changing $R_{\rm{wrong}}$. Choice Accuracy (d) and RT (e) versus coherence level the cost of the Wait action is changed while CBR is kept constant. RT values are smaller and Accuracy is lower compared to the cost of the Wait action, as can be seen by comparing black and gray curves corresponding to $R_{\rm{wait}} = -2$ and$R_{\rm{wait}} = -1$, respectively. Error bars are SEMs across trials; The simulation involved 1200 trials.
...and 16 more figures

Sequential sampling without comparison to boundary through model-free reinforcement learning

TL;DR

Abstract

Sequential sampling without comparison to boundary through model-free reinforcement learning

Authors

TL;DR

Abstract

Table of Contents

Figures (21)