Table of Contents
Fetching ...

Learning to Substitute Components for Compositional Generalization

Zhaoyi Li, Gangwei Jiang, Chenwang Wu, Ying Wei, Defu Lian, Enhong Chen

TL;DR

This work tackles the limited compositional generalization of neural language models by introducing CompSub, a span-based compositional data augmentation that enables multi-grained substitutions across training data. Building on CompSub, the authors present Learning Component Substitution (LCS), a differentiable augmenter that learns substitution probabilities by maximizing downstream loss, thereby prioritizing challenging and novel compositions; they further extend these ideas to in-context learning with LCS-ICL for state-of-the-art LLMs. Theoretical analyses show CompSub acts as an implicit regularizer that promotes semantic invariance and reduces Rademacher complexity, while empirical results across SCAN, COGS, GeoQuery, and COGS-QL demonstrate substantial gains (up to 66.5% on SCAN and 10.3% on COGS, with additional improvements for LCS and LCS-ICL). Overall, the approach provides a principled, end-to-end, and model-agnostic framework to inject multi-grained compositional bias and improve few-shot and in-context generalization in language tasks.

Abstract

Despite the rising prevalence of neural language models, recent empirical evidence suggests their deficiency in compositional generalization. One of the current de-facto solutions to this problem is compositional data augmentation, which aims to introduce additional compositional inductive bias. However, existing handcrafted augmentation strategies offer limited improvement when systematic generalization of neural language models requires multi-grained compositional bias (i.e., not limited to either lexical or structural biases alone) or when training sentences have an imbalanced difficulty distribution. To address these challenges, we first propose a novel compositional augmentation strategy called Component Substitution (CompSub), which enables multi-grained composition of substantial substructures across the entire training set. Furthermore, we introduce the Learning Component Substitution (LCS) framework. This framework empowers the learning of component substitution probabilities in CompSub in an end-to-end manner by maximizing the loss of neural language models, thereby prioritizing challenging compositions with elusive concepts and novel contexts. We extend the key ideas of CompSub and LCS to the recently emerging in-context learning scenarios of pre-trained large language models (LLMs), proposing the LCS-ICL algorithm to enhance the few-shot compositional generalization of state-of-the-art (SOTA) LLMs. Theoretically, we provide insights into why applying our algorithms to language models can improve compositional generalization performance. Empirically, our results on four standard compositional generalization benchmarks(SCAN, COGS, GeoQuery, and COGS-QL) demonstrate the superiority of CompSub, LCS, and LCS-ICL, with improvements of up to 66.5%, 10.3%, 1.4%, and 8.8%, respectively.

Learning to Substitute Components for Compositional Generalization

TL;DR

This work tackles the limited compositional generalization of neural language models by introducing CompSub, a span-based compositional data augmentation that enables multi-grained substitutions across training data. Building on CompSub, the authors present Learning Component Substitution (LCS), a differentiable augmenter that learns substitution probabilities by maximizing downstream loss, thereby prioritizing challenging and novel compositions; they further extend these ideas to in-context learning with LCS-ICL for state-of-the-art LLMs. Theoretical analyses show CompSub acts as an implicit regularizer that promotes semantic invariance and reduces Rademacher complexity, while empirical results across SCAN, COGS, GeoQuery, and COGS-QL demonstrate substantial gains (up to 66.5% on SCAN and 10.3% on COGS, with additional improvements for LCS and LCS-ICL). Overall, the approach provides a principled, end-to-end, and model-agnostic framework to inject multi-grained compositional bias and improve few-shot and in-context generalization in language tasks.

Abstract

Despite the rising prevalence of neural language models, recent empirical evidence suggests their deficiency in compositional generalization. One of the current de-facto solutions to this problem is compositional data augmentation, which aims to introduce additional compositional inductive bias. However, existing handcrafted augmentation strategies offer limited improvement when systematic generalization of neural language models requires multi-grained compositional bias (i.e., not limited to either lexical or structural biases alone) or when training sentences have an imbalanced difficulty distribution. To address these challenges, we first propose a novel compositional augmentation strategy called Component Substitution (CompSub), which enables multi-grained composition of substantial substructures across the entire training set. Furthermore, we introduce the Learning Component Substitution (LCS) framework. This framework empowers the learning of component substitution probabilities in CompSub in an end-to-end manner by maximizing the loss of neural language models, thereby prioritizing challenging compositions with elusive concepts and novel contexts. We extend the key ideas of CompSub and LCS to the recently emerging in-context learning scenarios of pre-trained large language models (LLMs), proposing the LCS-ICL algorithm to enhance the few-shot compositional generalization of state-of-the-art (SOTA) LLMs. Theoretically, we provide insights into why applying our algorithms to language models can improve compositional generalization performance. Empirically, our results on four standard compositional generalization benchmarks(SCAN, COGS, GeoQuery, and COGS-QL) demonstrate the superiority of CompSub, LCS, and LCS-ICL, with improvements of up to 66.5%, 10.3%, 1.4%, and 8.8%, respectively.

Paper Structure

This paper contains 32 sections, 7 theorems, 15 equations, 9 figures, 11 tables, 3 algorithms.

Key Result

Theorem 1

Let $h$ denote the negative likelihood loss function: $h(p) =-log(p)$. We have the following inequality: $\mathbb{E}_{x_s,y_s}[h(p_\theta(y_*|y_s,x_s\oplus x_*)) + 2\sum_{g\in\mathcal{G}} \lVert p_\theta(y_*|g \circ y_s, (g\circ x_s)\oplus x_*)-p_\theta(y_*|y_s,x_s\oplus x_*)\rVert_2^2] \leq \mathbb

Figures (9)

  • Figure 1: (a), (b) and (c) illustrate three distinct compositional generalization types in COGS cogs, which require word-level, subtree-level and general substructure-level recombinations of training data, respectively. Besides, (d) shows concepts in distinct difficulty in the SCAN scan dataset, where the interpretation of walk around right is much more complex than that of the other two concepts.
  • Figure 2: An augmentation example by CompSub. CompSub substitutes a span "largest" with another span "largest city in the smallest", and augments a new question "What is the population of the largest city in the smallest state?".
  • Figure 3: Examples of non-eligible and eligible spans in COGS. (a) shows a non-eligible span which corresponds to an union set of disconnected fragments of the tree.
  • Figure 4: Illustration of the LCS training framework. LCS training framework contains an upstream LCS augmentor and a downstream neural seq-to-seq model. Given an original training example $(x,y)$, the upstream LCS augmentor (parameter:$\phi$) predicts the probability distribution of the spans in $(x,y)$ to be substituted out and the probability distribution of the spans in the training set to be substituted in. Sampling the spans to be substituted out and substituted in from the above distributions, we augment the original training example to generate $(x_{gen},y_{gen})$ and send it into the down stream neural seq-to-seq model (parameter:$\theta$). In the parameter-update phase, we iteratively update $\phi$ by maximizing the loss of the downstream model and update $\theta$ by minimizing the loss of the downstream model.
  • Figure 5: The figure illustrates the workflow of the LCS-ICL algorithm when constructing a $k$-shot ICL-style prompt. The whole workflow mainly contains three stage. (1) Coarse Screening Stage: Select $m\approx\lceil k/2\rceil$ examples from $\mathcal{D}$: $\{(x_i,y_i)\}_{i=1}^{m}$ to guarantee that as many primitive concepts in the query $x_q$ are covered in $\{x_i\}_{i=1}^{m}$ as possible. (2) Demonstration Augmentation Stage: Introduce additional compositional inductive bias by running CompSub on $\{(x_i,y_i)\}_{i=1}^{m}$ to get an augmented demonstration pool $\mathcal{D^*}=\text{CompSub}(\{(x_i,y_i)\}_{i=1}^{m})$. (3) Fine Screening Stage: Successively retrieve the rest $n = k-m$ demonstrations from $\mathcal{D^*}$ with the policy of choosing the candidate demonstration that the model are difficult to handle (with the highest perplexity score) by in-context learning from currently selected demonstrations.
  • ...and 4 more figures

Theorems & Definitions (12)

  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Definition 1
  • Lemma 1
  • Proof
  • Theorem 3
  • Proof
  • Corollary 2
  • Proof
  • ...and 2 more