Table of Contents
Fetching ...

Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance

Timothy Pistotti, Jason Brown, Michael Witbrock

TL;DR

The paper investigates how stimulus design influences findings in LLM-based tests of syntactic knowledge under the Argument from the Poverty of the Stimulus. It identifies key confounds in parasitic-gap stimuli, introduces a pipeline that uses a SOTA generator to produce refined, unambiguous stimuli, and evaluates GPT-2 using surprisal-based metrics including $S(w_i \mid C) = -\log_2 P(w_i \mid C)$, $\Delta$, and DiD. Across original, filtered, and refined datasets, GPT-2 shows striking improvements on the refined stimuli, implying that prior observed failures may reflect stimulus noise rather than true lack of syntactic competence. The work emphasizes stimulus quality as a central factor in APS debates and proposes a practical approach for more reliable syntactic evaluation of LLMs.

Abstract

Recent studies employing Large Language Models (LLMs) to test the Argument from the Poverty of the Stimulus (APS) have yielded contrasting results across syntactic phenomena. This paper investigates the hypothesis that characteristics of the stimuli used in recent studies, including lexical ambiguities and structural complexities, may confound model performance. A methodology is proposed for re-evaluating LLM competence on syntactic prediction, focusing on GPT-2. This involves: 1) establishing a baseline on previously used (both filtered and unfiltered) stimuli, and 2) generating a new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5 Pro Preview) guided by linguistically-informed templates designed to mitigate identified confounds. Our preliminary findings indicate that GPT-2 demonstrates notably improved performance on these refined PG stimuli compared to baselines, suggesting that stimulus quality significantly influences outcomes in surprisal-based evaluations of LLM syntactic competency.

Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance

TL;DR

The paper investigates how stimulus design influences findings in LLM-based tests of syntactic knowledge under the Argument from the Poverty of the Stimulus. It identifies key confounds in parasitic-gap stimuli, introduces a pipeline that uses a SOTA generator to produce refined, unambiguous stimuli, and evaluates GPT-2 using surprisal-based metrics including , , and DiD. Across original, filtered, and refined datasets, GPT-2 shows striking improvements on the refined stimuli, implying that prior observed failures may reflect stimulus noise rather than true lack of syntactic competence. The work emphasizes stimulus quality as a central factor in APS debates and proposes a practical approach for more reliable syntactic evaluation of LLMs.

Abstract

Recent studies employing Large Language Models (LLMs) to test the Argument from the Poverty of the Stimulus (APS) have yielded contrasting results across syntactic phenomena. This paper investigates the hypothesis that characteristics of the stimuli used in recent studies, including lexical ambiguities and structural complexities, may confound model performance. A methodology is proposed for re-evaluating LLM competence on syntactic prediction, focusing on GPT-2. This involves: 1) establishing a baseline on previously used (both filtered and unfiltered) stimuli, and 2) generating a new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5 Pro Preview) guided by linguistically-informed templates designed to mitigate identified confounds. Our preliminary findings indicate that GPT-2 demonstrates notably improved performance on these refined PG stimuli compared to baselines, suggesting that stimulus quality significantly influences outcomes in surprisal-based evaluations of LLM syntactic competency.

Paper Structure

This paper contains 14 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: Comparison of GPT-2 accuracy on Parasitic Gap constructions. Accuracy is shown for the $\Delta_{+\text{filler}} > 0$ and Difference-in-Differences (DiD) > 0 criteria across the original lan2024large dataset, a filtered version, and our own refined stimuli. Error bars represent 95% confidence intervals.