Table of Contents
Fetching ...

Interpretability of Language Models via Task Spaces

Lucas Weber, Jaap Jumelet, Elia Bruni, Dieuwke Hupkes

TL;DR

This work tackles interpretability of language models by introducing linguistic task spaces, which map how models generalize across interconnected linguistic phenomena. It combines similarity probing with fine-tuning via gradient differentials (FTGD) to disentangle entangled linguistic tasks and quantify task relationships, applying the methods to decoder-based transformers of ~27M, ~70M, and ~203M parameters across different pre-training stages. Key contributions include FTGD for targeted, low-impact task fine-tuning, similarity/probing frameworks to build linguistic task spaces, and empirical findings that larger models generalize better to overarching linguistic concepts while distributing processing across shared structure; task spaces remain remarkably stable during training. This framework enables linguistic hypothesis testing and provides a principled path toward interpretable LM behavior, with potential extensions to other domains and larger LLMs.

Abstract

The usual way to interpret language models (LMs) is to test their performance on different benchmarks and subsequently infer their internal processes. In this paper, we present an alternative approach, concentrating on the quality of LM processing, with a focus on their language abilities. To this end, we construct 'linguistic task spaces' -- representations of an LM's language conceptualisation -- that shed light on the connections LMs draw between language phenomena. Task spaces are based on the interactions of the learning signals from different linguistic phenomena, which we assess via a method we call 'similarity probing'. To disentangle the learning signals of linguistic phenomena, we further introduce a method called 'fine-tuning via gradient differentials' (FTGD). We apply our methods to language models of three different scales and find that larger models generalise better to overarching general concepts for linguistic tasks, making better use of their shared structure. Further, the distributedness of linguistic processing increases with pre-training through increased parameter sharing between related linguistic tasks. The overall generalisation patterns are mostly stable throughout training and not marked by incisive stages, potentially explaining the lack of successful curriculum strategies for LMs.

Interpretability of Language Models via Task Spaces

TL;DR

This work tackles interpretability of language models by introducing linguistic task spaces, which map how models generalize across interconnected linguistic phenomena. It combines similarity probing with fine-tuning via gradient differentials (FTGD) to disentangle entangled linguistic tasks and quantify task relationships, applying the methods to decoder-based transformers of ~27M, ~70M, and ~203M parameters across different pre-training stages. Key contributions include FTGD for targeted, low-impact task fine-tuning, similarity/probing frameworks to build linguistic task spaces, and empirical findings that larger models generalize better to overarching linguistic concepts while distributing processing across shared structure; task spaces remain remarkably stable during training. This framework enables linguistic hypothesis testing and provides a principled path toward interpretable LM behavior, with potential extensions to other domains and larger LLMs.

Abstract

The usual way to interpret language models (LMs) is to test their performance on different benchmarks and subsequently infer their internal processes. In this paper, we present an alternative approach, concentrating on the quality of LM processing, with a focus on their language abilities. To this end, we construct 'linguistic task spaces' -- representations of an LM's language conceptualisation -- that shed light on the connections LMs draw between language phenomena. Task spaces are based on the interactions of the learning signals from different linguistic phenomena, which we assess via a method we call 'similarity probing'. To disentangle the learning signals of linguistic phenomena, we further introduce a method called 'fine-tuning via gradient differentials' (FTGD). We apply our methods to language models of three different scales and find that larger models generalise better to overarching general concepts for linguistic tasks, making better use of their shared structure. Further, the distributedness of linguistic processing increases with pre-training through increased parameter sharing between related linguistic tasks. The overall generalisation patterns are mostly stable throughout training and not marked by incisive stages, potentially explaining the lack of successful curriculum strategies for LMs.
Paper Structure (41 sections, 2 equations, 10 figures, 3 tables)

This paper contains 41 sections, 2 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: The process of similarity probing to obtain a task space based on transfers: 1. Evaluate the untuned LM on all tasks (eval1); 2. Tune one LM for each task; 3. Re-evaluate the LMs on all tasks (eval2). Calculate all transfers (eval2 - eval1) and compare the resulting transfer task space to a hypothesized set of transfers (Hypothesis space).
  • Figure 2: Parameter bins ranging from small to large gradients. Only a small amount of parameters carry the largest portion of gradient mass. Our cut-off ($\epsilon$) maintains a large portion of gradient mass while reducing the amount of trained parameters significantly.
  • Figure 3: (a) BLiMP accuracy per phenomenon before and after fine-tuning using full gradients or our gradient difference method. FTGD is either as effective or more effective in improving benchmark performance on all phenomena. (b) The relative increase in perplexity (ppl) on the wiki103 validation set during the fine-tuning process of models trained for 20 epochs. FTGD barely affects perplexity, while full gradients are highly disruptive.
  • Figure 4: Different similarity patterns within phenomena for LM203 after 20 epochs of pre-training. We find high similarity for all different paradigms in determiner noun agreement (a); high similarity but interfering subclusters for filler-gap dependencies (b); and no similarity for different binding paradigms (c). The exact identities of the individual rows and columns can be found in Table \ref{['tab:phenomena_paradigms']} in Appendix \ref{['ch2:app:similarity_spaces']}.
  • Figure 5: The degree of within-phenomena transfer for different models pre-trained for 20 epochs. A high value indicates that the model strongly generalises the phenomenon. A mapping of abbreviations to full names of phenomena can be found in Appendix \ref{['app:mapping_abbr_full_name']}
  • ...and 5 more figures