Table of Contents
Fetching ...

Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models

Najoung Kim, Tal Linzen, Paul Smolensky

TL;DR

The paper addresses how pretraining data can breach the distributional control of compositional generalization benchmarks by exposing context-controlled lexical items. It proposes two lexical-control modifications—replacing items with novel character sequences or novel embeddings—and introduces a Test-Lex lexical-difficulty test, then evaluates T5-base on COGS under these setups. Results show substantial degradation in generalization under both modifications (64–69% with character sequences; 6–32% with novel embeddings; up to ~51pp overestimation), with inverse scaling observed as pretraining data size increases. The findings suggest that prior claims of pretrained models' compositional generalization may be overstated, highlighting the need for evaluation designs that align with downstream needs and carefully separate lexical from structural/generalization factors. Overall, the work urges robust, situation-aware benchmarking for compositional generalization and motivates further study of structural generalization under controlled data conditions.

Abstract

Human linguistic capacity is often characterized by compositionality and the generalization it enables -- human learners can produce and comprehend novel complex expressions by composing known parts. Several benchmarks exploit distributional control across training and test to gauge compositional generalization, where certain lexical items only occur in limited contexts during training. While recent work using these benchmarks suggests that pretrained models achieve impressive generalization performance, we argue that exposure to pretraining data may break the aforementioned distributional control. Using the COGS benchmark of Kim and Linzen (2020), we test two modified evaluation setups that control for this issue: (1) substituting context-controlled lexical items with novel character sequences, and (2) substituting them with special tokens represented by novel embeddings. We find that both of these setups lead to lower generalization performance in T5 (Raffel et al., 2020), suggesting that previously reported results have been overestimated due to uncontrolled lexical exposure during pretraining. The performance degradation is more extreme with novel embeddings, and the degradation increases with the amount of pretraining data, highlighting an interesting case of inverse scaling.

Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models

TL;DR

The paper addresses how pretraining data can breach the distributional control of compositional generalization benchmarks by exposing context-controlled lexical items. It proposes two lexical-control modifications—replacing items with novel character sequences or novel embeddings—and introduces a Test-Lex lexical-difficulty test, then evaluates T5-base on COGS under these setups. Results show substantial degradation in generalization under both modifications (64–69% with character sequences; 6–32% with novel embeddings; up to ~51pp overestimation), with inverse scaling observed as pretraining data size increases. The findings suggest that prior claims of pretrained models' compositional generalization may be overstated, highlighting the need for evaluation designs that align with downstream needs and carefully separate lexical from structural/generalization factors. Overall, the work urges robust, situation-aware benchmarking for compositional generalization and motivates further study of structural generalization under controlled data conditions.

Abstract

Human linguistic capacity is often characterized by compositionality and the generalization it enables -- human learners can produce and comprehend novel complex expressions by composing known parts. Several benchmarks exploit distributional control across training and test to gauge compositional generalization, where certain lexical items only occur in limited contexts during training. While recent work using these benchmarks suggests that pretrained models achieve impressive generalization performance, we argue that exposure to pretraining data may break the aforementioned distributional control. Using the COGS benchmark of Kim and Linzen (2020), we test two modified evaluation setups that control for this issue: (1) substituting context-controlled lexical items with novel character sequences, and (2) substituting them with special tokens represented by novel embeddings. We find that both of these setups lead to lower generalization performance in T5 (Raffel et al., 2020), suggesting that previously reported results have been overestimated due to uncontrolled lexical exposure during pretraining. The performance degradation is more extreme with novel embeddings, and the degradation increases with the amount of pretraining data, highlighting an interesting case of inverse scaling.
Paper Structure (23 sections, 1 figure, 4 tables)

This paper contains 23 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Highly variable generalization performance of T5-base under different modifications proposed in this paper. Best reported performance using T5-base from orhan2021compositional is marked with a red dotted line. Overestimation refers to the difference between this red dotted line and the blue bars.