Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings

Yichen Jiang; Xiang Zhou; Mohit Bansal

Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings

Yichen Jiang, Xiang Zhou, Mohit Bansal

TL;DR

This work tackles the challenge of systematic generalization in Transformers, especially under low-complexity training data. It introduces SQ-Transformer, which couples Structure-oriented Vector Quantization (SoVQ) with two attention mechanisms—Systematic Attention Layer (SAL) and Systematically Regularized Layer (SRL)—to induce invariant or soft-invariant processing for sentences sharing the same structure. Empirically, SoVQ groups words by syntactic function, while SAL/SRL foster generalizable attention patterns, yielding improved compositional generalization on SCAN AddJump and AroundRight, and competitive or superior results on COGS, CoGnition, and WMT. The paper provides extensive analyses showing clustered embedding spaces and systematic attention, and discusses when hard invariance (SAL) versus soft invariance (SRL) is advantageous, contributing to understanding Transformer generalization and informing future architecture development.

Abstract

Transformers generalize to novel compositions of structures and entities after being trained on a complex dataset, but easily overfit on datasets of insufficient complexity. We observe that when the training set is sufficiently complex, the model encodes sentences that have a common syntactic structure using a systematic attention pattern. Inspired by this observation, we propose SQ-Transformer (Structurally Quantized) that explicitly encourages systematicity in the embeddings and attention layers, even with a training set of low complexity. At the embedding level, we introduce Structure-oriented Vector Quantization (SoVQ) to cluster word embeddings into several classes of structurally equivalent entities. At the attention level, we devise the Systematic Attention Layer (SAL) and an alternative, Systematically Regularized Layer (SRL) that operate on the quantized word embeddings so that sentences of the same structure are encoded with invariant or similar attention patterns. Empirically, we show that SQ-Transformer achieves stronger compositional generalization than the vanilla Transformer on multiple low-complexity semantic parsing and machine translation datasets. In our analysis, we show that SoVQ indeed learns a syntactically clustered embedding space and SAL/SRL induces generalizable attention patterns, which lead to improved systematicity.

Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings

TL;DR

Abstract

Paper Structure (57 sections, 1 theorem, 18 equations, 7 figures, 6 tables)

This paper contains 57 sections, 1 theorem, 18 equations, 7 figures, 6 tables.

Introduction
Background and Motivation
Vector Quantization
Brown Clustering
SQ-Transformer
Notations.
Structure-oriented Vector Quantization
Variational MMI objective
Variational Brown Clustering
Systematic Attention Layer
Systematically Regularized Layer
Experiments
Datasets
SCAN AddJump
SCAN AroundRight
...and 42 more sections

Key Result

Theorem 1

Let $x_a$ and $x_b$ be two tokens that only appear in the same sets of context $\hat{X}$. Let $p',q'=$ Then, we have: $q'(z|x_a) = q'(z|x_b)\;\;\forall z\in Z$, which means $x_a$ and $x_b$ are clustered into the same class in the optimal solution.

Figures (7)

Figure 1: Attention maps encoding a training example "walk around left" and a test example "jump around left" from the Transformers trained on the original SCAN AddJump training set (a and b), and 20x augmented training set (c and d) from zhou-jiang-2023-datafactor with 20 times more primitives like 'walk1' and more examples like "walk1 around left". We highlight the attention maps in (b) that differ from (a) in red boxes. When trained on 20x augmented training set, the model encodes the two examples with highly similar attention maps across all layers and heads (c and d). We show the attention maps of other training instances of the structure "$\$x$around left" in Fig. \ref{['fig:1x_20x_lookrunwalkjump_attn_maps']}.
Figure 2: Architecture of the Systematic Attention Layer (SAL) and the Systematically Regularized Layer (SRL).
Figure 3: T-SNE visualization of embeddings learned on SCAN AddJump dataset lake2018generalization.
Figure 4: T-SNE visualization of embeddings learned on SCAN AddJump dataset lake2018generalization with $n_a$ atomic expressions (e.g., jump$\mapsto$JUMP and walk$\mapsto$WALK) for each primitive and the model's accuracy (acc).
Figure 5: Encoder's attention maps and their average KL-divergence between the two examples, from the vanilla Transformer trained on the SCAN AddJump datasets of different number of distinct primitives zhou-jiang-2023-datafactor (original, 2x, 20x, 200x), where the model achieves 3.67%, 15.78%, 100%, and 100% accuracy on the entire test set respectively.
...and 2 more figures

Theorems & Definitions (1)

Theorem 1

Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings

TL;DR

Abstract

Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (1)