ReCOGS: How Incidental Details of a Logical Form Overshadow an Evaluation of Semantic Interpretation

Zhengxuan Wu; Christopher D. Manning; Christopher Potts

ReCOGS: How Incidental Details of a Logical Form Overshadow an Evaluation of Semantic Interpretation

Zhengxuan Wu, Christopher D. Manning, Christopher Potts

TL;DR

COGS benchmarks assess compositional generalization by mapping English sentences to logical forms, but results are confounded by incidental LF details. The authors demonstrate that meaning-preserving LF alterations—token removal, length artifacts, and variable-name bindings—dramatically influence performance, suggesting prior failures reflect representation choices rather than semantic incapacity. They propose ReCOGS, a revised benchmark with SEM-based evaluation, index randomization, and targeted data augmentations (e.g., preposing, filler words, participial verb phrases) to better isolate semantic generalization. Across LSTM and Transformer baselines, ReCOGS remains challenging but yields tangible traction, underscoring the need for robust benchmark design and careful interpretation of compositionality in semantic parsing.

Abstract

Compositional generalization benchmarks for semantic parsing seek to assess whether models can accurately compute meanings for novel sentences, but operationalize this in terms of logical form (LF) prediction. This raises the concern that semantically irrelevant details of the chosen LFs could shape model performance. We argue that this concern is realized for the COGS benchmark. COGS poses generalization splits that appear impossible for present-day models, which could be taken as an indictment of those models. However, we show that the negative results trace to incidental features of COGS LFs. Converting these LFs to semantically equivalent ones and factoring out capabilities unrelated to semantic interpretation, we find that even baseline models get traction. A recent variable-free translation of COGS LFs suggests similar conclusions, but we observe this format is not semantically equivalent; it is incapable of accurately representing some COGS meanings. These findings inform our proposal for ReCOGS, a modified version of COGS that comes closer to assessing the target semantic capabilities while remaining very challenging. Overall, our results reaffirm the importance of compositional generalization and careful benchmark task design.

ReCOGS: How Incidental Details of a Logical Form Overshadow an Evaluation of Semantic Interpretation

TL;DR

Abstract

Paper Structure (24 sections, 5 figures, 8 tables)

This paper contains 24 sections, 5 figures, 8 tables.

Introduction
Background: COGS Benchmark
Related Work
Approaches to COGS
COGS Artifacts
A Flawed Variable-Free COGS Representation
Experiments
Methods
Architectures
Training Details
No Pretraining
Experiment 1: Removing Redundant Tokens from LFs
LF Modifications
Results
Experiment 2: Separating Structural and Length Generalization
...and 9 more sections

Figures (5)

Figure 1: Converting COGS LFs into semantically equivalent LFs greatly impacts model performance: removing redundant tokens increases performance on the lexical (LEX) tasks, while length augmentation and meaning-preserving syntactic transformations help on the harder structural (STRUCT) tasks. ReCOGS incorporates these lessons while also decoupling variable names from linear position. The result is a more purely semantic task that remains extremely challenging for present-day models.
Figure 2: The frequencies of bigrams in the training data starting with , become more balanced after removing two incidental tokens {x_}.
Figure 3: Sequence length distributions for the COGS training split and the generalization splits. The generalization split has inputs and logical forms with lengths completely unseen in the training set.
Figure 4: Adding $k$ items with concatenated training examples to give exposure to long sequences greatly improves structural generalization on COGS for both LSTM-based and transformer-based models. (Transformer-based) SoTA performance is taken from zheng2020compositional. The plots show means (of 20 runs) with 95% confidence interval.
Figure 5: Model performance over different testing splits in COGS, ReCOGS$_{\text{POS}}$ (original variable name bindings are kept), and ReCOGS. We report means (of 20 runs) with bootstrapped 95% confidence intervals.

ReCOGS: How Incidental Details of a Logical Form Overshadow an Evaluation of Semantic Interpretation

TL;DR

Abstract

ReCOGS: How Incidental Details of a Logical Form Overshadow an Evaluation of Semantic Interpretation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)