Table of Contents
Fetching ...

SIP: Injecting a Structural Inductive Bias into a Seq2Seq Model by Simulation

Matthias Lindemann, Alexander Koller, Ivan Titov

TL;DR

SIP addresses the challenge that standard seq2seq models lack structural inductive biases, which hampers systematic generalization. By pre-training a Transformer to simulate Finite State Transducers from their descriptions and inputs, SIP injects a reusable inductive bias that improves both systematic generalization and few-shot learning, including transfer to natural tasks like grapheme-to-phoneme conversion and text editing. Probing shows the model internally simulates FST state transitions, and fine-tuning leverages these dynamics to solve unseen or longer-input tasks. The approach is computationally cheap relative to meta-learning and offers a flexible pathway to incorporating other structured biases such as Pushdown Transducers.

Abstract

Strong inductive biases enable learning from little data and help generalization outside of the training distribution. Popular neural architectures such as Transformers lack strong structural inductive biases for seq2seq NLP tasks on their own. Consequently, they struggle with systematic generalization beyond the training distribution, e.g. with extrapolating to longer inputs, even when pre-trained on large amounts of text. We show how a structural inductive bias can be efficiently injected into a seq2seq model by pre-training it to simulate structural transformations on synthetic data. Specifically, we inject an inductive bias towards Finite State Transducers (FSTs) into a Transformer by pre-training it to simulate FSTs given their descriptions. Our experiments show that our method imparts the desired inductive bias, resulting in improved systematic generalization and better few-shot learning for FST-like tasks. Our analysis shows that fine-tuned models accurately capture the state dynamics of the unseen underlying FSTs, suggesting that the simulation process is internalized by the fine-tuned model.

SIP: Injecting a Structural Inductive Bias into a Seq2Seq Model by Simulation

TL;DR

SIP addresses the challenge that standard seq2seq models lack structural inductive biases, which hampers systematic generalization. By pre-training a Transformer to simulate Finite State Transducers from their descriptions and inputs, SIP injects a reusable inductive bias that improves both systematic generalization and few-shot learning, including transfer to natural tasks like grapheme-to-phoneme conversion and text editing. Probing shows the model internally simulates FST state transitions, and fine-tuning leverages these dynamics to solve unseen or longer-input tasks. The approach is computationally cheap relative to meta-learning and offers a flexible pathway to incorporating other structured biases such as Pushdown Transducers.

Abstract

Strong inductive biases enable learning from little data and help generalization outside of the training distribution. Popular neural architectures such as Transformers lack strong structural inductive biases for seq2seq NLP tasks on their own. Consequently, they struggle with systematic generalization beyond the training distribution, e.g. with extrapolating to longer inputs, even when pre-trained on large amounts of text. We show how a structural inductive bias can be efficiently injected into a seq2seq model by pre-training it to simulate structural transformations on synthetic data. Specifically, we inject an inductive bias towards Finite State Transducers (FSTs) into a Transformer by pre-training it to simulate FSTs given their descriptions. Our experiments show that our method imparts the desired inductive bias, resulting in improved systematic generalization and better few-shot learning for FST-like tasks. Our analysis shows that fine-tuned models accurately capture the state dynamics of the unseen underlying FSTs, suggesting that the simulation process is internalized by the fine-tuned model.
Paper Structure (55 sections, 6 equations, 9 figures, 11 tables, 2 algorithms)

This paper contains 55 sections, 6 equations, 9 figures, 11 tables, 2 algorithms.

Figures (9)

  • Figure 1: Left: Pre-training a Transformer to simulate automatically generated FSTs. Right: fine-tuning the Transformer and the prefix where the FST used to be on a downstream task by using only input/output pairs. Tunable parameters are represented in orange.
  • Figure 2: Examples of functional FSTs. The FST in (a) deletes leading zeros. The FST in (b) replaces any 0 by a 1 if the last input symbol is a 1. Conversely, if the last symbol is a 2, any 0 is replaced by a 2. The output can only be determined after the last input symbol.
  • Figure 3: Evaluation on deterministic FST tasks with more states than seen in pre-training. We show the deviation in percentage points from ByT5.
  • Figure 4: Left: we train a linear probe on the encoder representations of a SIP pre-trained model to predict for each input token $x_i$ which state the encoded FST is in before processing $x_i$. The end-of-sequence token is represented as <s>. Right: we freeze the trained probe, fine-tune the SIP model on input/output pairs and extract state sequences from it with the probe.
  • Figure 5: Row-normalized confusion matrices on the training and test data between ground truth and the state predicted by the frozen probe applied to fine-tuned models. We average across the 5 iteration generalization tasks (\ref{['sec:within-pretrain']}).
  • ...and 4 more figures