Table of Contents
Fetching ...

Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments

Timothy Pistotti, Jason Brown, Michael Witbrock

TL;DR

This work evaluates how to diagnose LLM syntactic knowledge under the APS debate by contrasting direct minimal-pair testing with Difference-in-Differences metrics. By constructing a full 8-permutation parasitic-gap stimulus set and applying a Wilcox-style wh-effect analysis to GPT-2, the study demonstrates robust filler-gap licensing across licensing and violation contexts, challenging prior DiD-based conclusions. The results emphasize that metric design and stimulus quality critically shape interpretations of LLM competence in complex syntax. The proposed direct minimal-pair framework offers clearer, more interpretable diagnostics and should be extended to other models and PG phenomena to refine our understanding of LLM syntactic generalization and the APS discourse.

Abstract

Recent studies probing the Argument from the Poverty of the Stimulus (APS) have applied Large Language Models (LLMs) to test the learnability of complex syntax through surprisal-based metrics. However, divergent conclusions raise questions concerning the insights these metrics offer. While Wilcox et al. (2024) used direct minimal pair comparisons (the "wh-effect") to demonstrate that models successfully generalise knowledge of filler-gap dependencies, Lan et al. (2024) used a Difference-in-Differences (DiD) metric and found that models largely fail on parasitic gaps (PGs). This paper argues that the direct minimal pair approach offers greater diagnostic transparency. We demonstrate this by generating a full 8-permutation paradigm of refined PG stimuli and evaluating the GPT-2 model used in previous studies with a systematic Wilcox-style wh-effect analysis. Our results show that GPT-2 succeeds across all four tested conditions, indicating robust knowledge of filler-gap licensing principles even in complex PG environments. This finding, which contrasts with the more ambiguous results from DiD-style metrics, suggests that the choice of evaluation metric is critical for assessing an LLM's syntactic competence.

Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments

TL;DR

This work evaluates how to diagnose LLM syntactic knowledge under the APS debate by contrasting direct minimal-pair testing with Difference-in-Differences metrics. By constructing a full 8-permutation parasitic-gap stimulus set and applying a Wilcox-style wh-effect analysis to GPT-2, the study demonstrates robust filler-gap licensing across licensing and violation contexts, challenging prior DiD-based conclusions. The results emphasize that metric design and stimulus quality critically shape interpretations of LLM competence in complex syntax. The proposed direct minimal-pair framework offers clearer, more interpretable diagnostics and should be extended to other models and PG phenomena to refine our understanding of LLM syntactic generalization and the APS discourse.

Abstract

Recent studies probing the Argument from the Poverty of the Stimulus (APS) have applied Large Language Models (LLMs) to test the learnability of complex syntax through surprisal-based metrics. However, divergent conclusions raise questions concerning the insights these metrics offer. While Wilcox et al. (2024) used direct minimal pair comparisons (the "wh-effect") to demonstrate that models successfully generalise knowledge of filler-gap dependencies, Lan et al. (2024) used a Difference-in-Differences (DiD) metric and found that models largely fail on parasitic gaps (PGs). This paper argues that the direct minimal pair approach offers greater diagnostic transparency. We demonstrate this by generating a full 8-permutation paradigm of refined PG stimuli and evaluating the GPT-2 model used in previous studies with a systematic Wilcox-style wh-effect analysis. Our results show that GPT-2 succeeds across all four tested conditions, indicating robust knowledge of filler-gap licensing principles even in complex PG environments. This finding, which contrasts with the more ambiguous results from DiD-style metrics, suggests that the choice of evaluation metric is critical for assessing an LLM's syntactic competence.

Paper Structure

This paper contains 11 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Results of Lan et al. metrics on our dataset
  • Figure 2: Mean wh-effects for the four gap configurations. Error bars represent 95% confidence intervals. All effects are in the predicted direction and statistically significant.