Table of Contents
Fetching ...

Making Transformers Solve Compositional Tasks

Santiago Ontañón, Joshua Ainslie, Vaclav Cvicek, Zachary Fisher

TL;DR

The paper tackles the problem of Transformer failures in compositional generalization by systematically exploring Transformer design choices to induce inductive biases. It analyzes position encodings, decoder type, model size, weight sharing, and intermediate representations across 12 diverse datasets, all trained from scratch. Key findings include large gains from relative position encodings, copy decoders, and intermediate representations, culminating in state-of-the-art COGS performance (0.784) and strong PCFG results. These results underscore the importance of architecture-aware biases for enabling robust compositional generalization and guide future work on scaling and pre-training.

Abstract

Several studies have reported the inability of Transformer models to generalize compositionally, a key type of generalization in many NLP tasks such as semantic parsing. In this paper we explore the design space of Transformer models showing that the inductive biases given to the model by several design decisions significantly impact compositional generalization. Through this exploration, we identified Transformer configurations that generalize compositionally significantly better than previously reported in the literature in a diverse set of compositional tasks, and that achieve state-of-the-art results in a semantic parsing compositional generalization benchmark (COGS), and a string edit operation composition benchmark (PCFG).

Making Transformers Solve Compositional Tasks

TL;DR

The paper tackles the problem of Transformer failures in compositional generalization by systematically exploring Transformer design choices to induce inductive biases. It analyzes position encodings, decoder type, model size, weight sharing, and intermediate representations across 12 diverse datasets, all trained from scratch. Key findings include large gains from relative position encodings, copy decoders, and intermediate representations, culminating in state-of-the-art COGS performance (0.784) and strong PCFG results. These results underscore the importance of architecture-aware biases for enabling robust compositional generalization and guide future work on scaling and pre-training.

Abstract

Several studies have reported the inability of Transformer models to generalize compositionally, a key type of generalization in many NLP tasks such as semantic parsing. In this paper we explore the design space of Transformer models showing that the inductive biases given to the model by several design decisions significantly impact compositional generalization. Through this exploration, we identified Transformer configurations that generalize compositionally significantly better than previously reported in the literature in a diverse set of compositional tasks, and that achieve state-of-the-art results in a semantic parsing compositional generalization benchmark (COGS), and a string edit operation composition benchmark (PCFG).

Paper Structure

This paper contains 20 sections, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Examples from the different datasets used in our experiments.
  • Figure 2: An illustration of a Transformer, extended with the additional components necessary to explore the different dimensions we experiment with in this paper: (1) position encodings, (2) copy decoder, (3) model size ($l, d, f, h$), (4) weight sharing, and (5) intermediate representations.
  • Figure 3: Examples from the intermediate representations for COGs and CFQ. For COGs, we framed the task as sequence tagging and made the model predict 5 tags for each token; for CFQ we compressed Cartesian products.