What Context Features Can Transformer Language Models Use?
Joe O'Connor, Jacob Andreas
TL;DR
This study investigates why transformers benefit from long-range contexts by using the V-information framework to measure usable information and by conducting targeted ablations on WikiText-103 with a GPT-2–style model. It shows that long-range predictive power largely rests on content words and local co-occurrence statistics rather than detailed syntax or topic signals, since aggressive manipulations of word order or function-word content often remove little usable information. The results suggest that more efficient context representations, rather than simply longer contexts, could improve language modeling, and they highlight nuances between training/evaluation paradigms and the role of overfitting. The findings have practical implications for designing scalable, information-preserving context mechanisms in future transformer-based LMs.
Abstract
Transformer-based language models benefit from conditioning on contexts of hundreds to thousands of previous tokens. What aspects of these contexts contribute to accurate model prediction? We describe a series of experiments that measure usable information by selectively ablating lexical and structural information in transformer language models trained on English Wikipedia. In both mid- and long-range contexts, we find that several extremely destructive context manipulations -- including shuffling word order within sentences and deleting all words other than nouns -- remove less than 15% of the usable information. Our results suggest that long contexts, but not their detailed syntactic and propositional content, are important for the low perplexity of current transformer language models.
