Table of Contents
Fetching ...

Semformer: Transformer Language Models with Semantic Planning

Yongjing Yin, Junran Ding, Kai Song, Yue Zhang

TL;DR

Semformer is introduced, a novel method of training a Transformer language model that explicitly models the semantic planning of response, and incorporates a sequence of planning tokens into the prefix, guiding the planning token representations to predict the latent semantic representations of the response, which are induced by an autoencoder.

Abstract

Next-token prediction serves as the dominant component in current neural language models. During the training phase, the model employs teacher forcing, which predicts tokens based on all preceding ground truth tokens. However, this approach has been found to create shortcuts, utilizing the revealed prefix to spuriously fit future tokens, potentially compromising the accuracy of the next-token predictor. In this paper, we introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of response. Specifically, we incorporate a sequence of planning tokens into the prefix, guiding the planning token representations to predict the latent semantic representations of the response, which are induced by an autoencoder. In a minimal planning task (i.e., graph path-finding), our model exhibits near-perfect performance and effectively mitigates shortcut learning, a feat that standard training methods and baseline models have been unable to accomplish. Furthermore, we pretrain Semformer from scratch with 125M parameters, demonstrating its efficacy through measures of perplexity, in-context learning, and fine-tuning on summarization tasks.

Semformer: Transformer Language Models with Semantic Planning

TL;DR

Semformer is introduced, a novel method of training a Transformer language model that explicitly models the semantic planning of response, and incorporates a sequence of planning tokens into the prefix, guiding the planning token representations to predict the latent semantic representations of the response, which are induced by an autoencoder.

Abstract

Next-token prediction serves as the dominant component in current neural language models. During the training phase, the model employs teacher forcing, which predicts tokens based on all preceding ground truth tokens. However, this approach has been found to create shortcuts, utilizing the revealed prefix to spuriously fit future tokens, potentially compromising the accuracy of the next-token predictor. In this paper, we introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of response. Specifically, we incorporate a sequence of planning tokens into the prefix, guiding the planning token representations to predict the latent semantic representations of the response, which are induced by an autoencoder. In a minimal planning task (i.e., graph path-finding), our model exhibits near-perfect performance and effectively mitigates shortcut learning, a feat that standard training methods and baseline models have been unable to accomplish. Furthermore, we pretrain Semformer from scratch with 125M parameters, demonstrating its efficacy through measures of perplexity, in-context learning, and fine-tuning on summarization tasks.
Paper Structure (33 sections, 6 equations, 7 figures, 5 tables)

This paper contains 33 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The Clever Hans cheat in a graph path-finding problem which is a minimal lookahead task. The task is to find the correct path based on the adjacency list, the start node, and the target node.
  • Figure 2: Illustration of our Semformer. We introduce trainable tokens in language modeling. The representations of the tokens encoded by the language model are regressed to the latent representations of the response with $L_2$ loss. We can share the parameters between the language model and the encoder, and utilize a small decoder to enhance training efficiency.
  • Figure 3: Convergence curves of Teacher-less, BoW, and our Semformer on tasks G(5,30) and G(10,20).
  • Figure 4: Convergence curves of models with different latent dimensions.
  • Figure 5: Convergence curves of models with different numbers of planning tokens.
  • ...and 2 more figures