Table of Contents
Fetching ...

Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning

Wesley Scivetti, Tatsuya Aoyama, Ethan Wilcox, Nathan Schneider

TL;DR

This work probes whether human-scale language models can learn both the form and the meaning of the rare let-alone construction. Using a templated, minimal-pair benchmark evaluated with a SLOR-based metric, the authors train two OPT-based models on BabyLM 100M data and assess formal syntactic constraints and scalar semantics. They find strong form-generalization—formal constraints are learned robustly—even with very limited direct exposure, but no robust semantic generalization for let-alone, despite prominent semantic expectations in humans and LLM prompts. Filtering pretraining to remove let-alone-related data further shows that form knowledge persists via indirect evidence, while removing literal tokens dramatically harms performance, highlighting an asymmetry not present in human learners and suggesting that current architectures rely differently on form versus meaning information. The results emphasize the need to address semantic learning for rare constructions in smaller-scale LMs and to broaden evaluation to more constructions and languages.

Abstract

Humans have a remarkable ability to acquire and understand grammatical phenomena that are seen rarely, if ever, during childhood. Recent evidence suggests that language models with human-scale pretraining data may possess a similar ability by generalizing from frequent to rare constructions. However, it remains an open question how widespread this generalization ability is, and to what extent this knowledge extends to meanings of rare constructions, as opposed to just their forms. We fill this gap by testing human-scale transformer language models on their knowledge of both the form and meaning of the (rare and quirky) English LET-ALONE construction. To evaluate our LMs we construct a bespoke synthetic benchmark that targets syntactic and semantic properties of the construction. We find that human-scale LMs are sensitive to form, even when related constructions are filtered from the dataset. However, human-scale LMs do not make correct generalizations about LET-ALONE's meaning. These results point to an asymmetry in the current architectures' sample efficiency between language form and meaning, something which is not present in human language learners.

Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning

TL;DR

This work probes whether human-scale language models can learn both the form and the meaning of the rare let-alone construction. Using a templated, minimal-pair benchmark evaluated with a SLOR-based metric, the authors train two OPT-based models on BabyLM 100M data and assess formal syntactic constraints and scalar semantics. They find strong form-generalization—formal constraints are learned robustly—even with very limited direct exposure, but no robust semantic generalization for let-alone, despite prominent semantic expectations in humans and LLM prompts. Filtering pretraining to remove let-alone-related data further shows that form knowledge persists via indirect evidence, while removing literal tokens dramatically harms performance, highlighting an asymmetry not present in human learners and suggesting that current architectures rely differently on form versus meaning information. The results emphasize the need to address semantic learning for rare constructions in smaller-scale LMs and to broaden evaluation to more constructions and languages.

Abstract

Humans have a remarkable ability to acquire and understand grammatical phenomena that are seen rarely, if ever, during childhood. Recent evidence suggests that language models with human-scale pretraining data may possess a similar ability by generalizing from frequent to rare constructions. However, it remains an open question how widespread this generalization ability is, and to what extent this knowledge extends to meanings of rare constructions, as opposed to just their forms. We fill this gap by testing human-scale transformer language models on their knowledge of both the form and meaning of the (rare and quirky) English LET-ALONE construction. To evaluate our LMs we construct a bespoke synthetic benchmark that targets syntactic and semantic properties of the construction. We find that human-scale LMs are sensitive to form, even when related constructions are filtered from the dataset. However, human-scale LMs do not make correct generalizations about LET-ALONE's meaning. These results point to an asymmetry in the current architectures' sample efficiency between language form and meaning, something which is not present in human language learners.

Paper Structure

This paper contains 22 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Results on Syntactic Tests:(a) shows $\Delta\mathcal{S}$ where higher delta values indicate a greater effect of the constraint. Patterns are consistent with the grammaticality of the syntactic manipulation. (b) shows $\Delta\mathcal{S}$(LetAlone)- $\Delta\mathcal{S}$(And).
  • Figure 2: Results on Semantic Tests:(a) shows $\Delta\mathcal{S}$ where higher delta values indicate a greater effect of the constraint. Patterns are consistent with the grammaticality of the syntactic manipulation. (b) shows $\Delta\mathcal{S}$(LetAlone)- $\Delta\mathcal{S}$(And).
  • Figure 3: Top 10 Accuracies on the Semantic Tests when separated by predicate, noun, and comparative that fill the template. Error bars indicate 95% confidence intervals over the results of two random seeds. Above-chance accuracies indicate that the model has some nontrivial semantic performance on that template.
  • Figure 4: Filtered Pretraining Results. Accuracies are calculated according to Equation 3. Error bars are 95% confidence intervals over the mean accuracies across two randomly seeded runs. \ref{['tab:filt_res']} presents the same data.
  • Figure 5: $\Delta$SLOR(Let-Alone) $-$$\Delta$SLOR(And) for Filtered Pretraining Conditions. Positive Differences indicate sensitivity to the construction. Differences are generally still positive after removing direct evidence from related constructions (NoRel, NoAny) generally, but become negative when all 'let' or 'alone' tokens are removed.