Table of Contents
Fetching ...

Holmes: A Benchmark to Assess the Linguistic Competence of Language Models

Andreas Waldis, Yotam Perlitz, Leshem Choshen, Yufang Hou, Iryna Gurevych

TL;DR

Holmes, a new benchmark designed to assess language models' linguistic competence - their unconscious understanding of linguistic phenomena, is introduced and a streamlined version is proposed that reduces the computation load while maintaining high-ranking precision.

Abstract

We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence - their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs' internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing Holmes, we review over 270 probing studies and include more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version that reduces the computation load while maintaining high-ranking precision.

Holmes: A Benchmark to Assess the Linguistic Competence of Language Models

TL;DR

Holmes, a new benchmark designed to assess language models' linguistic competence - their unconscious understanding of linguistic phenomena, is introduced and a streamlined version is proposed that reduces the computation load while maintaining high-ranking precision.

Abstract

We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence - their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs' internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing Holmes, we review over 270 probing studies and include more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version that reduces the computation load while maintaining high-ranking precision.
Paper Structure (79 sections, 15 figures, 9 tables)

This paper contains 79 sections, 15 figures, 9 tables.

Figures (15)

  • Figure 1: In Holmes, we encode examples of probing datasets using frozen LMs. Then, we train probes (linear models) with labels representing the specific linguistic phenomenon under test. Finally, we use the results of testing the probes to approximate the LMs' linguistic competence regarding the tested phenomena.
  • Figure 2: Overview of Holmes (left) with the five phenomena types (right) and an example of probing-based evaluations for part-of-speech: encoding the input tokens and predicting the POS tag for cucumber, here NN.
  • Figure 3: Citation analysis considering probing citations originating from the set of relevant work and every other citation (general citations). The color scale indicates the ratio ($\alpha$) between them.
  • Figure 4: Categorization of the selected studies by their focus and their conducted probing method.
  • Figure 5: Overview of how many tasks single LMs cover and vice versa - single examples are highlighted.
  • ...and 10 more figures