Table of Contents
Fetching ...

Designing and Interpreting Probes with Control Tasks

John Hewitt, Percy Liang

TL;DR

This paper introduces control tasks to diagnose whether supervised probes truly extract linguistic structure from representations or simply memorize task mappings. By defining selectivity as the gap between linguistic-task accuracy and control-task accuracy, the authors show that many popular probes (notably MLPs) memorize rather than reveal linguistic content, while linear and bilinear probes can achieve high selectivity with minimal loss in linguistic performance. The study also demonstrates that regularization methods like dropout are not reliably helpful for selectivity, whereas constrained hidden sizes and careful training data choices can improve it; layer comparisons must incorporate selectivity to avoid misleading conclusions. Finally, the work reveals that ELMo’s second layer can be more selective for POS than the first, challenging assumptions about which layer encodes linguistic properties, and emphasizes the value of selectivity in interpreting representations. These insights provide a more nuanced framework for probing contextual representations and comparing layers or tasks.

Abstract

Probes, supervised models trained to predict properties (like parts-of-speech) from representations (like ELMo), have achieved high accuracy on a range of linguistic tasks. But does this mean that the representations encode linguistic structure or just that the probe has learned the linguistic task? In this paper, we propose control tasks, which associate word types with random outputs, to complement linguistic tasks. By construction, these tasks can only be learned by the probe itself. So a good probe, (one that reflects the representation), should be selective, achieving high linguistic task accuracy and low control task accuracy. The selectivity of a probe puts linguistic task accuracy in context with the probe's capacity to memorize from word types. We construct control tasks for English part-of-speech tagging and dependency edge prediction, and show that popular probes on ELMo representations are not selective. We also find that dropout, commonly used to control probe complexity, is ineffective for improving selectivity of MLPs, but that other forms of regularization are effective. Finally, we find that while probes on the first layer of ELMo yield slightly better part-of-speech tagging accuracy than the second, probes on the second layer are substantially more selective, which raises the question of which layer better represents parts-of-speech.

Designing and Interpreting Probes with Control Tasks

TL;DR

This paper introduces control tasks to diagnose whether supervised probes truly extract linguistic structure from representations or simply memorize task mappings. By defining selectivity as the gap between linguistic-task accuracy and control-task accuracy, the authors show that many popular probes (notably MLPs) memorize rather than reveal linguistic content, while linear and bilinear probes can achieve high selectivity with minimal loss in linguistic performance. The study also demonstrates that regularization methods like dropout are not reliably helpful for selectivity, whereas constrained hidden sizes and careful training data choices can improve it; layer comparisons must incorporate selectivity to avoid misleading conclusions. Finally, the work reveals that ELMo’s second layer can be more selective for POS than the first, challenging assumptions about which layer encodes linguistic properties, and emphasizes the value of selectivity in interpreting representations. These insights provide a more nuanced framework for probing contextual representations and comparing layers or tasks.

Abstract

Probes, supervised models trained to predict properties (like parts-of-speech) from representations (like ELMo), have achieved high accuracy on a range of linguistic tasks. But does this mean that the representations encode linguistic structure or just that the probe has learned the linguistic task? In this paper, we propose control tasks, which associate word types with random outputs, to complement linguistic tasks. By construction, these tasks can only be learned by the probe itself. So a good probe, (one that reflects the representation), should be selective, achieving high linguistic task accuracy and low control task accuracy. The selectivity of a probe puts linguistic task accuracy in context with the probe's capacity to memorize from word types. We construct control tasks for English part-of-speech tagging and dependency edge prediction, and show that popular probes on ELMo representations are not selective. We also find that dropout, commonly used to control probe complexity, is ineffective for improving selectivity of MLPs, but that other forms of regularization are effective. Finally, we find that while probes on the first layer of ELMo yield slightly better part-of-speech tagging accuracy than the second, probes on the second layer are substantially more selective, which raises the question of which layer better represents parts-of-speech.

Paper Structure

This paper contains 27 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Our control tasks define random behavior (like a random output, top) for each word type in the vocabulary. Each word token is assigned its type's output, regardless of context (middle, bottom.) Control tasks have the same input and output space as a linguistic task (e.g., parts-of-speech) but can only be learned if the probe memorizes the mapping.
  • Figure 2: Selectivity is defined as the difference between linguistic task accuracy and control task accuracy, and can vary widely, as shown, across probes which achieve similar linguistic task accuracies. These results taken from § \ref{['sectionselectivityresults']}.
  • Figure 3: Example dependency tree from the development set of the Penn Treebank with dependents pointing at heads, and the structure resulting from our dependency edge prediction control task on the same sentence.
  • Figure 4: Linguistic task accuracies and selectivities for the 5 complexity control methods. All methods except dropout and early stopping are shown to improve selectivity without a large impact on linguistic task accuracy. All methods for the same task share a common y-axis, and use their own categorical x-axis. All x-axes are ordered from most severe constraints on complexity (left) to most laissez-faire (right).