Table of Contents
Fetching ...

Learning to Plan for Language Modeling from Unlabeled Data

Nathan Cornille, Marie-Francine Moens, Florian Mai

TL;DR

This work tackles the limitation of purely next-token-based planning in large language models by introducing an external planner that learns to predict abstract writing actions from unlabeled data. It derives these actions from clustering sentence embeddings, and integrates planner-predicted actions into the language model through an adapter, enabling planning without task-specific supervision. Empirically, the approach yields perplexity improvements and stronger text-structure generation across GPT-2 and OLMo models, with external planners outperforming internal planning strategies. The proposed modular, self-supervised planning framework supports scalable development and sharing of planning capabilities across models, suggesting a path toward more coherent, structure-aware language generation at scale.

Abstract

By training to predict the next token in an unlabeled corpus, large language models learn to perform many tasks without any labeled data. However, their next-token-prediction objective arguably limits their performance in scenarios that require planning, such as writing a coherent article. In this paper, we train a module for planning the future writing process via a self-supervised learning objective. Given the textual context, this planning module learns to predict future abstract writing actions, which correspond to centroids in a clustered text embedding space. By conditioning on these actions, our model extends the successful language model formula to more abstract planning in an unsupervised way. Empirically, we demonstrate that our method improves language modeling performance in general, particularly with respect to the text structure. Because our framework uses a planner module that is unsupervised and external to the language model, new planner modules can be trained at large scale and easily be shared with the community.

Learning to Plan for Language Modeling from Unlabeled Data

TL;DR

This work tackles the limitation of purely next-token-based planning in large language models by introducing an external planner that learns to predict abstract writing actions from unlabeled data. It derives these actions from clustering sentence embeddings, and integrates planner-predicted actions into the language model through an adapter, enabling planning without task-specific supervision. Empirically, the approach yields perplexity improvements and stronger text-structure generation across GPT-2 and OLMo models, with external planners outperforming internal planning strategies. The proposed modular, self-supervised planning framework supports scalable development and sharing of planning capabilities across models, suggesting a path toward more coherent, structure-aware language generation at scale.

Abstract

By training to predict the next token in an unlabeled corpus, large language models learn to perform many tasks without any labeled data. However, their next-token-prediction objective arguably limits their performance in scenarios that require planning, such as writing a coherent article. In this paper, we train a module for planning the future writing process via a self-supervised learning objective. Given the textual context, this planning module learns to predict future abstract writing actions, which correspond to centroids in a clustered text embedding space. By conditioning on these actions, our model extends the successful language model formula to more abstract planning in an unsupervised way. Empirically, we demonstrate that our method improves language modeling performance in general, particularly with respect to the text structure. Because our framework uses a planner module that is unsupervised and external to the language model, new planner modules can be trained at large scale and easily be shared with the community.
Paper Structure (41 sections, 5 equations, 4 figures, 7 tables)

This paper contains 41 sections, 5 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The three core phases of our proposed method to learn a planner from unlabeled data. Blue indicates frozen parameters, yellow indicates trainable parameters, and grey indicates no learnable parameters.
  • Figure 2: Performance by number of clusters.
  • Figure 3: The blue dots show what the average perplexity is when conditioning on the $k$'th best action (in terms of what perplexity it leads to), with the rank $k$ on the horizontal axis. The red horizontal line displays the average perplexity of selecting the oracle code, the blue vertical line shows the rank with the nearest average perplexity to the oracle perplexity.
  • Figure 4: The green curve shows what the average perplexity is when conditioning on the $k$'th best noisy variation of the oracle action embedding (in terms of what perplexity it leads to), with the rank $k$ on the horizontal axis. The green dotted horizontal line indicates the perplexity of the best noise variant, and the green dotted vertical line the rank with the nearest average perplexity to the oracle perplexity among noise variations. The rest is the same as in Figure \ref{['fig:better_than_oracle']}.