Table of Contents
Fetching ...

Learning to Plan Long-Term for Language Modeling

Florian Mai, Nathan Cornille, Marie-Francine Moens

TL;DR

By sampling multiple plans at once, this paper condition the language model on an accurate approximation of the distribution of text continuations, which leads to better next token prediction accuracy.

Abstract

Modern language models predict the next token in the sequence by considering the past text through a powerful function such as attention. However, language models have no explicit mechanism that allows them to spend computation time for planning long-distance future text, leading to a suboptimal token prediction. In this paper, we propose a planner that predicts a latent plan for many sentences into the future. By sampling multiple plans at once, we condition the language model on an accurate approximation of the distribution of text continuations, which leads to better next token prediction accuracy. In effect, this allows trading computation time for prediction accuracy.

Learning to Plan Long-Term for Language Modeling

TL;DR

By sampling multiple plans at once, this paper condition the language model on an accurate approximation of the distribution of text continuations, which leads to better next token prediction accuracy.

Abstract

Modern language models predict the next token in the sequence by considering the past text through a powerful function such as attention. However, language models have no explicit mechanism that allows them to spend computation time for planning long-distance future text, leading to a suboptimal token prediction. In this paper, we propose a planner that predicts a latent plan for many sentences into the future. By sampling multiple plans at once, we condition the language model on an accurate approximation of the distribution of text continuations, which leads to better next token prediction accuracy. In effect, this allows trading computation time for prediction accuracy.
Paper Structure (24 sections, 6 equations, 3 figures, 3 tables)

This paper contains 24 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our method.
  • Figure 2: Performance and relative generation time as a function of the number of samples $K$ drawn.
  • Figure 3: Perplexity on the validation set depending on the sampling temperature $\tau$. Since the textual context in the evaluation on the validation set is shorter, reported perplexities are larger than on the test set.