Table of Contents
Fetching ...

SpiderNets: Vision Models Predict Human Fear From Aversive Images

Dominik Pegler, David Steyrl, Mengfan Zhang, Alexander Karner, Jozsef Arato, Frank Scharnowski, Filip Melinscak

TL;DR

This work demonstrates that pretrained CNN and vision-transformer architectures, fine-tuned with transfer learning, can predict image-evoked fear ratings for spider-related stimuli at the group level with high accuracy ($MAE$ ~9.8–11.0, $R^2$ ~0.61–0.66). Using a strict dual-level cross-validation scheme and gradient-based explainability, the authors show predictions are grounded in spider-related visual cues, with transformer models showing data efficiency and ensemble gains improving performance to $MAE$ ≈ $9.13$. The study also provides a quantitative error analysis and interpretable visualizations (Grad-CAM and feature visualization) to identify conditions under which predictions falter, such as extreme fear levels or visually ambiguous scenes. Overall, the findings establish a transparent, data-driven approach to estimating image-evoked fear that could underpin adaptive digital mental health tools, including exposure therapies and VR-based interventions, while highlighting practical limitations and directions for personalization and safety safeguards.

Abstract

Phobias are common and impairing, and exposure therapy, which involves confronting patients with fear-provoking visual stimuli, is the most effective treatment. Scalable computerized exposure therapy requires automated prediction of fear directly from image content to adapt stimulus selection and treatment intensity. Whether such predictions can be made reliably and generalize across individuals and stimuli, however, remains unknown. Here we show that pretrained convolutional and transformer vision models, adapted via transfer learning, accurately predict group-level perceived fear for spider-related images, even when evaluated on new people and new images, achieving a mean absolute error (MAE) below 10 units on the 0-100 fear scale. Visual explanation analyses indicate that predictions are driven by spider-specific regions in the images. Learning-curve analyses show that transformer models are data efficient and approach performance saturation with the available data (~300 images). Prediction errors increase for very low and very high fear levels and within specific categories of images. These results establish transparent, data-driven fear estimation from images, laying the groundwork for adaptive digital mental health tools.

SpiderNets: Vision Models Predict Human Fear From Aversive Images

TL;DR

This work demonstrates that pretrained CNN and vision-transformer architectures, fine-tuned with transfer learning, can predict image-evoked fear ratings for spider-related stimuli at the group level with high accuracy ( ~9.8–11.0, ~0.61–0.66). Using a strict dual-level cross-validation scheme and gradient-based explainability, the authors show predictions are grounded in spider-related visual cues, with transformer models showing data efficiency and ensemble gains improving performance to . The study also provides a quantitative error analysis and interpretable visualizations (Grad-CAM and feature visualization) to identify conditions under which predictions falter, such as extreme fear levels or visually ambiguous scenes. Overall, the findings establish a transparent, data-driven approach to estimating image-evoked fear that could underpin adaptive digital mental health tools, including exposure therapies and VR-based interventions, while highlighting practical limitations and directions for personalization and safety safeguards.

Abstract

Phobias are common and impairing, and exposure therapy, which involves confronting patients with fear-provoking visual stimuli, is the most effective treatment. Scalable computerized exposure therapy requires automated prediction of fear directly from image content to adapt stimulus selection and treatment intensity. Whether such predictions can be made reliably and generalize across individuals and stimuli, however, remains unknown. Here we show that pretrained convolutional and transformer vision models, adapted via transfer learning, accurately predict group-level perceived fear for spider-related images, even when evaluated on new people and new images, achieving a mean absolute error (MAE) below 10 units on the 0-100 fear scale. Visual explanation analyses indicate that predictions are driven by spider-specific regions in the images. Learning-curve analyses show that transformer models are data efficient and approach performance saturation with the available data (~300 images). Prediction errors increase for very low and very high fear levels and within specific categories of images. These results establish transparent, data-driven fear estimation from images, laying the groundwork for adaptive digital mental health tools.

Paper Structure

This paper contains 54 sections, 3 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Transfer Learning from a Pretrained Base Model to a Task-Adapted Model
  • Figure 2: Nested Cross-Validation with Random Hyperparameter Search
  • Figure 3: Predictive Performance Overview
  • Figure 4: Explainability Results: Grad-CAM and Feature Visualizations
  • Figure 5: Absolute Error vs. Fear, and Highest-Error Images
  • ...and 15 more figures