I am a Strange Dataset: Metalinguistic Tests for Language Models

Tristan Thrush; Jared Moore; Miguel Monares; Christopher Potts; Douwe Kiela

I am a Strange Dataset: Metalinguistic Tests for Language Models

Tristan Thrush, Jared Moore, Miguel Monares, Christopher Potts, Douwe Kiela

TL;DR

This work introduces I am a Strange Dataset to probe metalinguistic self-reference in language models via generation and verification tasks, complemented by non-self-referential controls. Across 208 examples (plus auxiliary sets) and multiple model families, most LLMs perform near chance, with GPT-4 showing the only consistent but modest above-chance gains (~60%), while humans reach ~89–93%. The study formalizes generation and validation metrics, including loss-based and prompting-based schemes, and highlights that metalinguistic reasoning remains a hard bottleneck despite model scale. The dataset is openly released with encryption to curb data leakage, and the authors emphasize the need for broader metalinguistic evaluation as a lens on model intelligence and reasoning. Overall, the work underscores the limits of current LLMs in metalinguistic self-reference and motivates targeted improvements and robust evaluation practices.

Abstract

Statements involving metalinguistic self-reference ("This paper has six sections.") are prevalent in many domains. Can current large language models (LLMs) handle such language? In this paper, we present "I am a Strange Dataset", a new dataset for addressing this question. There are two subtasks: generation and verification. In generation, models continue statements like "The penultimate word in this sentence is" (where a correct continuation is "is"). In verification, models judge the truth of statements like "The penultimate word in this sentence is sentence." (false). We also provide minimally different metalinguistic non-self-reference examples to complement the main dataset by probing for whether models can handle metalinguistic language at all. The dataset is hand-crafted by experts and validated by non-expert annotators. We test a variety of open-source LLMs (7B to 70B parameters) as well as closed-source LLMs through APIs. All models perform close to chance across both subtasks and even on the non-self-referential metalinguistic control data, though we find some steady improvement with model scale. GPT 4 is the only model to consistently do significantly better than chance, and it is still only in the 60% range, while our untrained human annotators score well in the 89-93% range. The dataset and evaluation toolkit are available at https://github.com/TristanThrush/i-am-a-strange-dataset.

I am a Strange Dataset: Metalinguistic Tests for Language Models

TL;DR

Abstract

Paper Structure (28 sections, 4 equations, 10 figures, 14 tables)

This paper contains 28 sections, 4 equations, 10 figures, 14 tables.

Introduction
Related Work
I am a Strange Dataset
Dataset
Tags
Metrics
Generation
Validation
Non-Self-Referential Control
Human Experiment Details
Results
Conclusion
Dataset Release Strategy
Limitations
Ethical Considerations
...and 13 more sections

Figures (10)

Figure 1: An example highlighting the challenge presented by our task. All models that we tested on our dataset are close to chance-level.
Figure 2: Examples from the dataset. Each example is comprised of a beginning and two different endings. One of the endings makes the statement true, but it would make the statement false if it referred only to the beginning. The other ending makes the statement false, but it would make the statement true if it referred only to the beginning. True endings are on the left and shown in blue. False endings are on the right and shown in red. In the case of the code example, the true continuation is shown above the false one.
Figure 3: GPT 4 misses the last words are not repeated.
Figure 4: An example of GPT 4 getting a non-self-referential version of the problem from Figure \ref{['fig:gpt4']} wrong.
Figure 5: Arguably, an example where GPT 4 should not have gotten points. This is an example where GPT 4 chooses the correct true/false response, but with incorrect reasoning. The "1" symbol appears twice.
...and 5 more figures

I am a Strange Dataset: Metalinguistic Tests for Language Models

TL;DR

Abstract

I am a Strange Dataset: Metalinguistic Tests for Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)