Table of Contents
Fetching ...

RTify: Aligning Deep Neural Networks with Human Behavioral Decisions

Yu-Ang Cheng, Ivan Felipe Rodriguez, Sixuan Chen, Kohitij Kar, Takeo Watanabe, Thomas Serre

TL;DR

The paper addresses the mismatch between static accuracy metrics in vision models and the dynamic, time-resolved nature of human decisions. It introduces RTify, a differentiable framework that learns to align recurrent network dynamics to human reaction times by mapping hidden states to evidence and accumulating until a threshold is reached. The approach supports both supervised training on human RTs and self-penalized, ideal-observer optimization, and includes a differentiable, multi-class Wong-Wang module that can plug into CNNs. Across random dot motion and natural-image categorization tasks, RTify achieves superior fits to human RT distributions and reveals that human-like speed-accuracy trade-offs can emerge from self-penalized optimization, offering a pathway toward integrated, human-aligned vision-decision models.

Abstract

Current neural network models of primate vision focus on replicating overall levels of behavioral accuracy, often neglecting perceptual decisions' rich, dynamic nature. Here, we introduce a novel computational framework to model the dynamics of human behavioral choices by learning to align the temporal dynamics of a recurrent neural network (RNN) to human reaction times (RTs). We describe an approximation that allows us to constrain the number of time steps an RNN takes to solve a task with human RTs. The approach is extensively evaluated against various psychophysics experiments. We also show that the approximation can be used to optimize an "ideal-observer" RNN model to achieve an optimal tradeoff between speed and accuracy without human data. The resulting model is found to account well for human RT data. Finally, we use the approximation to train a deep learning implementation of the popular Wong-Wang decision-making model. The model is integrated with a convolutional neural network (CNN) model of visual processing and evaluated using both artificial and natural image stimuli. Overall, we present a novel framework that helps align current vision models with human behavior, bringing us closer to an integrated model of human vision.

RTify: Aligning Deep Neural Networks with Human Behavioral Decisions

TL;DR

The paper addresses the mismatch between static accuracy metrics in vision models and the dynamic, time-resolved nature of human decisions. It introduces RTify, a differentiable framework that learns to align recurrent network dynamics to human reaction times by mapping hidden states to evidence and accumulating until a threshold is reached. The approach supports both supervised training on human RTs and self-penalized, ideal-observer optimization, and includes a differentiable, multi-class Wong-Wang module that can plug into CNNs. Across random dot motion and natural-image categorization tasks, RTify achieves superior fits to human RT distributions and reveals that human-like speed-accuracy trade-offs can emerge from self-penalized optimization, offering a pathway toward integrated, human-aligned vision-decision models.

Abstract

Current neural network models of primate vision focus on replicating overall levels of behavioral accuracy, often neglecting perceptual decisions' rich, dynamic nature. Here, we introduce a novel computational framework to model the dynamics of human behavioral choices by learning to align the temporal dynamics of a recurrent neural network (RNN) to human reaction times (RTs). We describe an approximation that allows us to constrain the number of time steps an RNN takes to solve a task with human RTs. The approach is extensively evaluated against various psychophysics experiments. We also show that the approximation can be used to optimize an "ideal-observer" RNN model to achieve an optimal tradeoff between speed and accuracy without human data. The resulting model is found to account well for human RT data. Finally, we use the approximation to train a deep learning implementation of the popular Wong-Wang decision-making model. The model is integrated with a convolutional neural network (CNN) model of visual processing and evaluated using both artificial and natural image stimuli. Overall, we present a novel framework that helps align current vision models with human behavior, bringing us closer to an integrated model of human vision.

Paper Structure

This paper contains 18 sections, 1 theorem, 15 equations, 11 figures.

Key Result

Proposition 1

Let us define $\tau_{\theta}^*(\Phi_{t}) = \min\{t \in \mathbb{R}_{[1, N]} : \Phi_{t} > \theta\}$ as the time in which $\Phi_{t}$ reaches the threshold of activity $\theta$. Provided that $\Phi_{t}$ is continuously differentiable:

Figures (11)

  • Figure 1: Illustration of our RTify method. The input is a visual stimulus represented by random moving dots, but the model can also accommodate color images and video sequences. We take a pretrained task-optimized RNN and use a trainable function $f_w$ to transform the activity of the network into a real-valued evidence measure, $e_{t}$, that will be integrated over time by an evidence accumulator, $\Phi_{t}$. When the evidence accumulator reaches the threshold $\theta$, processing stops, and a decision is taken. The time step at which the accumulated evidence passes this threshold $\tau_{\theta}$ is taken as the model RT for this stimulus.
  • Figure 2: RTified model evaluation on a RDM task green2010. Human data are shown as a gray shaded area, and model fits are shown for (A) the "supervised" setting where human behavioral responses are used to train the models and (B) the "self-penalized" setting where no human data is used. Our approach (green) outperforms the two alternative approaches (brown), i.e., entropy-thresholding spoerer2020 for the "supervised" and uncertainty proxy Goetschalckx for the "self-penalized" settings (see Fig. \ref{['fig:Fig4']} for MSE comparisons and Fig. \ref{['fig:FigS3']} for all coherences).
  • Figure 3: Illustration of RTifying feedforward neural networks. We develop a multi-class compatible and fully differentiable RNN module based on the WW model wang2002Wong2006-xa. This module is implemented as an attractor-based RNN, and is stacked on top of a feedforward neural network. The feedforward neural network first takes an image as the input. Outputs from classification units of the network are then sent to RTified WW (A). Information is accumulated by multiple populations of neurons in RTified WW while they compete with each other (B). A decision is made and the process stops when one of the populations reaches a threshold. The number of time steps needed for the RTified WW to reach the threshold is used to predict human RT (C).
  • Figure 4: (A)MSE comparisons for the RDM task green2010 for all coherence levels. The RTified model trained in the "supervised" setting (i.e., with human behavioral responses; green solid line) performs better (lower MSE) than entropy-thresholding spoerer2020 (brown solid line) under all coherence levels. Similarly, the RTified model trained in the "self-penalized" setting (i.e., without human data; green dash line) performs better than uncertainty proxy Goetschalckx (brown dash line). With the help of our RTified WW module (orange solid line), a convolution neural network (C3D) can also fit the data better than entropy-thresholding spoerer2020. (B)Classification accuracy comparisons between pretrained and RTified models for the RDM task green2010. The RTified model trained with human RTs data in the "supervised" setting (green solid line) and in the "self-penalized" setting (green dash line) achieve human-like classification accuracy under all coherence levels compared with the pretrained model without RTify (green dotted line). With the help of our RTified WW module (orange solid line), a CNN (C3D) matches human accuracy better than the pretrained model without RTify (orange dotted line).
  • Figure 5: RTified model evaluation on an object categorization task kar2019evidence. Model vs. human RT predictions for our RTified model (green) vs. alternative approaches (brown) (A) in the "supervised" setting where human behavioral responses are used to train the model and (B) the "self-penalized" setting where no human data is used. Solid lines are linear regression fits between model and human RTs. Crossed-shaded areas and the dashed lines are controls to show the fits after removing the highest model RTs. Our approach outperforms the two alternative approaches, i.e., entropy-thresholding spoerer2020 for the "supervised" setting and uncertainty proxy Goetschalckx for the "self-penalized" setting.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof