Table of Contents
Fetching ...

L-WISE: Boosting Human Visual Category Learning Through Model-Based Image Selection and Enhancement

Morgan B. Talbot, Gabriel Kreiman, James J. DiCarlo, Guy Gaziv

TL;DR

The paper addresses enhancing human visual category learning by leveraging robustified ANNs to both predict image recognition difficulty and generate category-enhancing image perturbations. It introduces L-WISE, a curriculum design that samples training images by model-predicted difficulty and applies perturbations that maximize the ground-truth percept, guided by robust ResNet-50 (and XCiT) models. Across moth species, dermoscopy, and histology tasks, L-WISE yields substantial gains in test-time accuracy (33-72%) and reduces training time by 20-23%, with contributions from both difficulty-based selection and image enhancement. This work demonstrates a concrete, model-aligned approach to human perceptual learning and prompts careful consideration of biases and safety in real-world educational and clinical applications, supported by reproducible code and methodology.

Abstract

The currently leading artificial neural network models of the visual ventral stream - which are derived from a combination of performance optimization and robustification methods - have demonstrated a remarkable degree of behavioral alignment with humans on visual categorization tasks. We show that image perturbations generated by these models can enhance the ability of humans to accurately report the ground truth class. Furthermore, we find that the same models can also be used out-of-the-box to predict the proportion of correct human responses to individual images, providing a simple, human-aligned estimator of the relative difficulty of each image. Motivated by these observations, we propose to augment visual learning in humans in a way that improves human categorization accuracy at test time. Our learning augmentation approach consists of (i) selecting images based on their model-estimated recognition difficulty, and (ii) applying image perturbations that aid recognition for novice learners. We find that combining these model-based strategies leads to categorization accuracy gains of 33-72% relative to control subjects without these interventions, on unmodified, randomly selected held-out test images. Beyond the accuracy gain, the training time for the augmented learning group was also shortened by 20-23%, despite both groups completing the same number of training trials. We demonstrate the efficacy of our approach in a fine-grained categorization task with natural images, as well as two tasks in clinically relevant image domains - histology and dermoscopy - where visual learning is notoriously challenging. To the best of our knowledge, our work is the first application of artificial neural networks to increase visual learning performance in humans by enhancing category-specific image features.

L-WISE: Boosting Human Visual Category Learning Through Model-Based Image Selection and Enhancement

TL;DR

The paper addresses enhancing human visual category learning by leveraging robustified ANNs to both predict image recognition difficulty and generate category-enhancing image perturbations. It introduces L-WISE, a curriculum design that samples training images by model-predicted difficulty and applies perturbations that maximize the ground-truth percept, guided by robust ResNet-50 (and XCiT) models. Across moth species, dermoscopy, and histology tasks, L-WISE yields substantial gains in test-time accuracy (33-72%) and reduces training time by 20-23%, with contributions from both difficulty-based selection and image enhancement. This work demonstrates a concrete, model-aligned approach to human perceptual learning and prompts careful consideration of biases and safety in real-world educational and clinical applications, supported by reproducible code and methodology.

Abstract

The currently leading artificial neural network models of the visual ventral stream - which are derived from a combination of performance optimization and robustification methods - have demonstrated a remarkable degree of behavioral alignment with humans on visual categorization tasks. We show that image perturbations generated by these models can enhance the ability of humans to accurately report the ground truth class. Furthermore, we find that the same models can also be used out-of-the-box to predict the proportion of correct human responses to individual images, providing a simple, human-aligned estimator of the relative difficulty of each image. Motivated by these observations, we propose to augment visual learning in humans in a way that improves human categorization accuracy at test time. Our learning augmentation approach consists of (i) selecting images based on their model-estimated recognition difficulty, and (ii) applying image perturbations that aid recognition for novice learners. We find that combining these model-based strategies leads to categorization accuracy gains of 33-72% relative to control subjects without these interventions, on unmodified, randomly selected held-out test images. Beyond the accuracy gain, the training time for the augmented learning group was also shortened by 20-23%, despite both groups completing the same number of training trials. We demonstrate the efficacy of our approach in a fine-grained categorization task with natural images, as well as two tasks in clinically relevant image domains - histology and dermoscopy - where visual learning is notoriously challenging. To the best of our knowledge, our work is the first application of artificial neural networks to increase visual learning performance in humans by enhancing category-specific image features.

Paper Structure

This paper contains 27 sections, 4 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Robustified ANNs can be used out-of-the-box as image recognition difficulty estimators and ground truth percept enhancers. We consider a 16-way basic animal classification task. Panel A1 shows the correspondence between human categorization accuracy and model-computed ground truth logit activation values. The curve denotes a logistic regression model predicting the probability of a correct response using only the logit value ($p < 0.001$ from the Wald statistic, $\text{AUC}=0.72$ under 10-fold cross validation). A2 shows example images with varying ground truth logit values (predicted difficulty). B1 shows how perturbing images via ground truth logit maximization increases human recognition accuracy progressively with the $\ell_{2}$-norm perturbation pixel budget $\epsilon$. Other off-the-shelf image enhancement methods do not increase categorization accuracy, despite inducing larger perturbations of $\epsilon=43$, $\epsilon=106$, and $\epsilon=26$ on average from left to right. B2 shows example images: unmodified (left), enhanced by ground truth logit maximization with pixel budgets ${\epsilon}=10$ and ${\epsilon}=20$ (middle), and enhanced by baseline off-the-shelf methods (to the right of the dotted line). All vertical error bars are 95% confidence intervals by bootstrap. Horizontal error bars in panel A1 show the standard deviation among images within each logit value bin.
  • Figure 2: Robustified ANNs can be used to boost image category learning in humans. A novice human learner undertakes a challenging image categorization task, which consists of a training phase (B) and a test phase (C). Images for both phases are randomly drawn from a labeled image dataset of unfamiliar fine-grained categories (A). Feedback (correct/incorrect, with indication of the correct category) is delivered after each trial during the training phase only. Our proposed "Logit-Weighted Image Selection and Enhancement" (L-WISE) approach uses an ANN model (D) to augment the visual curriculum by using the difficulty score to sample images based on a predefined increasing schedule of maximal difficulty per trial (E), and by enhancing images for easier recognition with an enhancement magnitude that decreases along a predefined schedule (F).
  • Figure 3: Novice learners who had their curriculum augmented by our method showed improved test-time categorization accuracy for previously unfamiliar categories. This figure shows empirical results from a 4-way fine-grained moth species classification task. Panel A shows examples of the 4 moth classes, side-by-side with their model-enhanced versions at the highest pixel budget used in our experiments ($\epsilon=8$). While subtle, one notable difference is the distinctive wing spots of moth class 2, which are enlarged in the enhanced version of the image. Also included are difference images showing the (5x magnified) difference between original and enhanced images, and heat maps with more red coloration in regions of larger changes from enhancement. B compares the average smoothed accuracy of participants in the L-WISE group and a control group. Shaded areas denote the standard error of the mean. The test accuracy gain of the L-WISE group relative to the control group is statistically significant ($\chi^2(1)$ test, $p < 0.001$). C, D show the trial-dependent empirical profiles of the average image difficulty percentile of selected images, which (noisily) increases step-wise, and the perturbation pixel budget for enhancement ($\epsilon$), which decreases step-wise. These profiles are uniform in the control group (black dotted lines), denoting randomly-chosen non-enhanced images.
  • Figure 4: Our approach can boost time efficiency and final accuracy of image category learning for humans across varied image domains, including in clinically relevant tasks. Panel A compares the mean test-phase accuracy and training-phase duration of human participants who were randomized to L-WISE or control groups and learned a moth photo, dermoscopy, or histology classification task. All differences between L-WISE and the control group are statistically significant ($\chi^2(1)$ test, $p < 0.05$). Panel B shows precision and recall in L-WISE and control groups, with each point representing a specific class in one of the three tasks. All error bars show 95% bootstrap confidence intervals. Each class from the dermoscopy and histology tasks is illustrated in panels C and D respectively, similarly to the moth classes in Fig. \ref{['fig:learning_tasks']}A.
  • Figure S1: Ground truth logit enhancement with robustified ANNs leads to semantically meaningful perturbations. The top row shows original ImageNet images, and the second row shows the same images after enhancement by robustified ResNet-50 (training $\epsilon=3$) with a pixel budget of $\epsilon=20$. The third row shows a 5x magnified version of the difference between the enhanced image and the original, and the bottom row shows a heat map where red regions correspond to larger changes and blue regions correspond to smaller changes.
  • ...and 15 more figures