Self-Attention-Based Contextual Modulation Improves Neural System Identification
Isaac Lin, Tianye Wang, Shang Gao, Shiming Tang, Tai Sing Lee
TL;DR
Self-Attention-Based Contextual Modulation Improves Neural System Identification investigates how SA can augment CNNs to predict macaque V1 responses to natural images. The study demonstrates that a simple SA layer, when paired with a CNN and trained with an incremental learning protocol, improves both Pearson correlation of the predicted tuning curves and the ability to predict peak tuning, compared to a parameter-matched baseline. By factorizing contextual modulation into convolutions, SA, and a readout, the authors show that local receptive-field information dominates overall tuning, while surround information is essential for accurately predicting the strongest responses; incremental learning helps separate these contributions. The combination of SA and a fully connected readout yields complementary benefits, and incremental training highlights center-surround interactions that resemble early visual processing. These findings advance understanding of surround modulation in cortical computation and point to data-efficient strategies for neural prediction models.
Abstract
Convolutional neural networks (CNNs) have been shown to be state-of-the-art models for visual cortical neurons. Cortical neurons in the primary visual cortex are sensitive to contextual information mediated by extensive horizontal and feedback connections. Standard CNNs integrate global contextual information to model contextual modulation via two mechanisms: successive convolutions and a fully connected readout layer. In this paper, we find that self-attention (SA), an implementation of non-local network mechanisms, can improve neural response predictions over parameter-matched CNNs in two key metrics: tuning curve correlation and peak tuning. We introduce peak tuning as a metric to evaluate a model's ability to capture a neuron's top feature preference. We factorize networks to assess each context mechanism, revealing that information in the local receptive field is most important for modeling overall tuning, but surround information is critically necessary for characterizing the tuning peak. We find that self-attention can replace posterior spatial-integration convolutions when learned incrementally, and is further enhanced in the presence of a fully connected readout layer, suggesting that the two context mechanisms are complementary. Finally, we find that decomposing receptive field learning and contextual modulation learning in an incremental manner may be an effective and robust mechanism for learning surround-center interactions.
