Identifying and Controlling Important Neurons in Neural Machine Translation
Anthony Bau, Yonatan Belinkov, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, James Glass
TL;DR
This work shows that individual neurons in neural machine translation carry linguistically meaningful information and are not entirely distributed. It introduces unsupervised cross-model correlation methods (MaxCorr, MinCorr, LinReg, SVCCA) to identify important neurons without annotations, and validates their importance via erasure and lightweight verification. The authors demonstrate linguistically interpretable neurons (e.g., for parentheses, tense) and show that activating or suppressing specific neurons can controllably affect translations, including tense, number, and gender, with varying success. The results have implications for bias mitigation, interpretability, and targeted control in MT, and the framework is adaptable to other neural tasks and architectures.
Abstract
Neural machine translation (NMT) models learn representations containing substantial linguistic information. However, it is not clear if such information is fully distributed or if some of it can be attributed to individual neurons. We develop unsupervised methods for discovering important neurons in NMT models. Our methods rely on the intuition that different models learn similar properties, and do not require any costly external supervision. We show experimentally that translation quality depends on the discovered neurons, and find that many of them capture common linguistic phenomena. Finally, we show how to control NMT translations in predictable ways, by modifying activations of individual neurons.
