Measures of Information Reflect Memorization Patterns
Rachit Bansal, Danish Pruthi, Yonatan Belinkov
TL;DR
The paper tackles generalization failures in neural networks caused by heuristic and example-level memorization. It introduces entropy $H$ and mutual information $I$ over neuron activations to quantify intra- and inter-neuron diversity as intrinsic indicators of memorization. Across semi-synthetic and natural NLP/vision tasks, networks displaying heuristic memorization show low $H$ and high $I$, while example-level memorization yields high $H$ and low $I$, with consistent patterns aiding model selection. This intrinsic, data-efficient framework provides a scalable tool for diagnosing memorization and guiding model selection without relying on curated OOD test sets.
Abstract
Neural networks are known to exploit spurious artifacts (or shortcuts) that co-occur with a target label, exhibiting heuristic memorization. On the other hand, networks have been shown to memorize training examples, resulting in example-level memorization. These kinds of memorization impede generalization of networks beyond their training distributions. Detecting such memorization could be challenging, often requiring researchers to curate tailored test sets. In this work, we hypothesize -- and subsequently show -- that the diversity in the activation patterns of different neurons is reflective of model generalization and memorization. We quantify the diversity in the neural activations through information-theoretic measures and find support for our hypothesis on experiments spanning several natural language and vision tasks. Importantly, we discover that information organization points to the two forms of memorization, even for neural activations computed on unlabelled in-distribution examples. Lastly, we demonstrate the utility of our findings for the problem of model selection. The associated code and other resources for this work are available at https://rachitbansal.github.io/information-measures.
