Table of Contents
Fetching ...

MENTOR: Human Perception-Guided Pretraining for Increased Generalization

Colton R. Crum, Adam Czajka

TL;DR

MENTOR tackles open-set anomaly detection by first learning human-perception–driven saliency through an autoencoder, then fine-tuning a classifier on top of the encoder with standard supervision. This decoupled, two-stage pretraining improves generalization over ImageNet initialization and existing perception-guided losses across iris PAD, synthetic faces, and chest X-ray diagnosis, while remaining architecture-agnostic. The method also proves compatible with CYBORG and UNET+Gaze, enhancing their performance without structural changes. Collectively, MENTOR offers a data-efficient, versatile approach to embed human perceptual priors into CNNs for robust cross-domain anomaly detection.

Abstract

Leveraging human perception into training of convolutional neural networks (CNN) has boosted generalization capabilities of such models in open-set recognition tasks. One of the active research questions is where (in the model architecture or training pipeline) and how to efficiently incorporate always limited human perceptual data into training strategies of models. In this paper, we introduce MENTOR (huMan pErceptioN-guided preTraining fOr increased geneRalization), which addresses this question through two unique rounds of training CNNs tasked with open-set anomaly detection. First, we train an autoencoder to learn human saliency maps given an input image, without any class labels. The autoencoder is thus tasked with discovering domain-specific salient features which mimic human perception. Second, we remove the decoder part, add a classification layer on top of the encoder, and train this new model conventionally, now using class labels. We show that MENTOR successfully raises the generalization performance across three different CNN backbones in a variety of anomaly detection tasks (demonstrated for detection of unknown iris presentation attacks, synthetically-generated faces, and anomalies in chest X-ray images) compared to traditional pretraining methods (e.g., sourcing the weights from ImageNet), and as well as state-of-the-art methods that incorporate human perception guidance into training. In addition, we demonstrate that MENTOR can be flexibly applied to existing human perception-guided methods and subsequently increasing their generalization with no architectural modifications.

MENTOR: Human Perception-Guided Pretraining for Increased Generalization

TL;DR

MENTOR tackles open-set anomaly detection by first learning human-perception–driven saliency through an autoencoder, then fine-tuning a classifier on top of the encoder with standard supervision. This decoupled, two-stage pretraining improves generalization over ImageNet initialization and existing perception-guided losses across iris PAD, synthetic faces, and chest X-ray diagnosis, while remaining architecture-agnostic. The method also proves compatible with CYBORG and UNET+Gaze, enhancing their performance without structural changes. Collectively, MENTOR offers a data-efficient, versatile approach to embed human perceptual priors into CNNs for robust cross-domain anomaly detection.

Abstract

Leveraging human perception into training of convolutional neural networks (CNN) has boosted generalization capabilities of such models in open-set recognition tasks. One of the active research questions is where (in the model architecture or training pipeline) and how to efficiently incorporate always limited human perceptual data into training strategies of models. In this paper, we introduce MENTOR (huMan pErceptioN-guided preTraining fOr increased geneRalization), which addresses this question through two unique rounds of training CNNs tasked with open-set anomaly detection. First, we train an autoencoder to learn human saliency maps given an input image, without any class labels. The autoencoder is thus tasked with discovering domain-specific salient features which mimic human perception. Second, we remove the decoder part, add a classification layer on top of the encoder, and train this new model conventionally, now using class labels. We show that MENTOR successfully raises the generalization performance across three different CNN backbones in a variety of anomaly detection tasks (demonstrated for detection of unknown iris presentation attacks, synthetically-generated faces, and anomalies in chest X-ray images) compared to traditional pretraining methods (e.g., sourcing the weights from ImageNet), and as well as state-of-the-art methods that incorporate human perception guidance into training. In addition, we demonstrate that MENTOR can be flexibly applied to existing human perception-guided methods and subsequently increasing their generalization with no architectural modifications.
Paper Structure (27 sections, 4 equations, 3 figures, 5 tables)

This paper contains 27 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: MENTOR approach. Step 1: An autoencoder-based model is first trained to recreate human saliency maps, and thus to build an understanding of human perception-sourced salient features into the encoder $\mathcal{F}_\text{encoder}$.Step 2: Such pre-trained encoder $\mathcal{F}_\text{encoder}$ is then decoupled from the autoencoder, and along with a classifier $\mathcal{F}_\text{class}$ are tuned for the anomaly detection task utilizing standard cross-entropy loss. MENTOR has been designed and evaluation to address generalization required in a single domain (in this paper independently for iris presentation attack, synthetic face and chest X-ray-based anomaly detection).
  • Figure 2: A byproduct of the MENTOR pre-training approach is the autoencoder $\mathcal{F}_\text{decoder}(\mathcal{F}_\text{encoder}(\cdot))$ predicting saliency maps that resemble human salience (top row: iris presentation attack detection, middle row: synthetic face detection, and bottom row: chest X-ray-based diagnosis).
  • Figure 3: Same as Fig. \ref{['fig:annotator']}, except UNET+Gaze karargyris2021creation, CYBORG boyd2021cyborg, and Cross-entropy (Xent) saliency maps are shown. MENTOR and UNET+Gaze saliency are generated from their respective decoder, whereas Cyborg and Xent are generated using a Class Activation Mapping (CAM). Samples were generated from the validation set.