Limited but consistent gains in adversarial robustness by co-training object recognition models with human EEG
Manshan Guo, Bhavin Choksi, Sari Sadiya, Alessandro T. Gifford, Martina G. Vilas, Radoslaw M. Cichy, Gemma Roig
TL;DR
This work tackles the vulnerability of object-recognition networks to adversarial perturbations by testing whether aligning model representations with human EEG responses improves robustness. It introduces a dual-task learning framework in which a ResNet50 backbone jointly predicts EEG signals and object labels, trained on a large-scale THINGS EEG dataset with 17-channel recordings and 100 Hz timing. The study finds a positive correlation between EEG-prediction accuracy and robustness gains across multiple architectures and attacks, with the strongest signals around 100 ms post-stimulus and mid-level parieto‑occipital channels driving much of the effect; however, the gains are modest and persist even with shuffled EEG controls. These results suggest that scalable, brain-informed regularization via EEG data can aid adversarial robustness, motivating larger, more diverse EEG datasets and multimodal stimulus conditions to amplify the effect.
Abstract
In contrast to human vision, artificial neural networks (ANNs) remain relatively susceptible to adversarial attacks. To address this vulnerability, efforts have been made to transfer inductive bias from human brains to ANNs, often by training the ANN representations to match their biological counterparts. Previous works relied on brain data acquired in rodents or primates using invasive techniques, from specific regions of the brain, under non-natural conditions (anesthetized animals), and with stimulus datasets lacking diversity and naturalness. In this work, we explored whether aligning model representations to human EEG responses to a rich set of real-world images increases robustness to ANNs. Specifically, we trained ResNet50-backbone models on a dual task of classification and EEG prediction; and evaluated their EEG prediction accuracy and robustness to adversarial attacks. We observed significant correlation between the networks' EEG prediction accuracy, often highest around 100 ms post stimulus onset, and their gains in adversarial robustness. Although effect size was limited, effects were consistent across different random initializations and robust for architectural variants. We further teased apart the data from individual EEG channels and observed strongest contribution from electrodes in the parieto-occipital regions. The demonstrated utility of human EEG for such tasks opens up avenues for future efforts that scale to larger datasets under diverse stimuli conditions with the promise of stronger effects.
