Slowing Learning by Erasing Simple Features
Lucia Quirke, Nora Belrose
TL;DR
The paper investigates the learning biases of neural networks by erasing simple statistical cues from data. It introduces LEACE for first-order information and two quadratic erasure methods, QLEACE and ALF-QLEACE, grounded in optimal transport theory, to remove informative moments from representations. Across CIFAR-10, CIFARNet, SVHN, and modern architectures, LEACE consistently slows learning, while quadratic erasure yields mixed, architecture-dependent outcomes including backfiring in some cases; gradient-based and ALF variants show augmentation-like effects but can be unreliable. The work highlights both the potential and pitfalls of concept erasure for interpretability and robustness, emphasizing caution when applying quadratic erasure in practice.
Abstract
Prior work suggests that neural networks tend to learn low-order moments of the data distribution first, before moving on to higher-order correlations. In this work, we derive a novel closed-form concept erasure method, QLEACE, which surgically removes all quadratically available information about a concept from a representation. Through comparisons with linear erasure (LEACE) and two approximate forms of quadratic erasure, we explore whether networks can still learn when low-order statistics are removed from image classification datasets. We find that while LEACE consistently slows learning, quadratic erasure can exhibit both positive and negative effects on learning speed depending on the choice of dataset, model architecture, and erasure method. Use of QLEACE consistently slows learning in feedforward architectures, but more sophisticated architectures learn to use injected higher order Shannon information about class labels. Its approximate variants avoid injecting information, but surprisingly act as data augmentation techniques on some datasets, enhancing learning speed compared to LEACE.
