Linguistic Interpretability of Transformer-based Language Models: a systematic review
Miguel López-Otal, Jorge Gracia, Jordi Bernad, Carlos Bobed, Lucía Pitarch-Ballesteros, Emma Anglés-Herrero
TL;DR
This systematic review addresses how Transformer-based pre-trained language models internalize linguistic knowledge. By synthesizing 160 within-representation studies across multiple languages, it shows that syntax and lexico-semantics are generally well captured, with morphology and discourse receiving more variable attention. The authors categorize discovery methods (notably probing) and map where in the network various linguistic phenomena tend to reside, highlighting both common patterns (e.g., middle-layer emphasis for English syntax; subspace geometries) and cross-language variability. The work underscores the practical implications for cross-lingual NLP, model transparency, and future interpretability research, while calling for causal probing and broader architectural coverage.
Abstract
Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.
