Table of Contents
Fetching ...

Predicting vs. Acting: A Trade-off Between World Modeling & Agent Modeling

Margaret Li, Weijia Shi, Artidoro Pagnoni, Peter West, Ari Holtzman

TL;DR

The paper investigates a fundamental trade-off between world modeling (next-token prediction) and agent modeling (interactive, goal-directed generation) in RLHF-tuned LLMs. Through cross-model perplexity analyses, hidden-state probing, n-gram and MAFFT-based alignment, and blueprint visualization (anchor spans), it shows that RLHF aligns models toward action at the cost of broad world-modeling capabilities, concentrating probability onto self-predictable spans. The key findings are that RLHF models underperform Base LMs on next-token prediction, exhibit distribution collapse with anchor spans acting as implicit blueprints, and use forward-predictive hidden representations to plan ahead, suggesting a general, perhaps inevitable, trade-off between predicting wide distributions and acting coherently within a narrower subspace. The work argues for integrating world and agent modeling to retain broad predictive power while enabling robust, long-horizon planning, with implications for the design of future aligned AI systems.

Abstract

RLHF-aligned LMs have shown unprecedented ability on both benchmarks and long-form text generation, yet they struggle with one foundational task: next-token prediction. As RLHF models become agent models aimed at interacting with humans, they seem to lose their world modeling -- the ability to predict what comes next in arbitrary documents, which is the foundational training objective of the Base LMs that RLHF adapts. Besides empirically demonstrating this trade-off, we propose a potential explanation: to perform coherent long-form generation, RLHF models restrict randomness via implicit blueprints. In particular, RLHF models concentrate probability on sets of anchor spans that co-occur across multiple generations for the same prompt, serving as textual scaffolding but also limiting a model's ability to generate documents that do not include these spans. We study this trade-off on the most effective current agent models, those aligned with RLHF, while exploring why this may remain a fundamental trade-off between models that act and those that predict, even as alignment techniques improve.

Predicting vs. Acting: A Trade-off Between World Modeling & Agent Modeling

TL;DR

The paper investigates a fundamental trade-off between world modeling (next-token prediction) and agent modeling (interactive, goal-directed generation) in RLHF-tuned LLMs. Through cross-model perplexity analyses, hidden-state probing, n-gram and MAFFT-based alignment, and blueprint visualization (anchor spans), it shows that RLHF aligns models toward action at the cost of broad world-modeling capabilities, concentrating probability onto self-predictable spans. The key findings are that RLHF models underperform Base LMs on next-token prediction, exhibit distribution collapse with anchor spans acting as implicit blueprints, and use forward-predictive hidden representations to plan ahead, suggesting a general, perhaps inevitable, trade-off between predicting wide distributions and acting coherently within a narrower subspace. The work argues for integrating world and agent modeling to retain broad predictive power while enabling robust, long-horizon planning, with implications for the design of future aligned AI systems.

Abstract

RLHF-aligned LMs have shown unprecedented ability on both benchmarks and long-form text generation, yet they struggle with one foundational task: next-token prediction. As RLHF models become agent models aimed at interacting with humans, they seem to lose their world modeling -- the ability to predict what comes next in arbitrary documents, which is the foundational training objective of the Base LMs that RLHF adapts. Besides empirically demonstrating this trade-off, we propose a potential explanation: to perform coherent long-form generation, RLHF models restrict randomness via implicit blueprints. In particular, RLHF models concentrate probability on sets of anchor spans that co-occur across multiple generations for the same prompt, serving as textual scaffolding but also limiting a model's ability to generate documents that do not include these spans. We study this trade-off on the most effective current agent models, those aligned with RLHF, while exploring why this may remain a fundamental trade-off between models that act and those that predict, even as alignment techniques improve.
Paper Structure (29 sections, 2 equations, 16 figures, 4 tables)

This paper contains 29 sections, 2 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: RLHF model generations on the same prompt are highly similar to each other, unlike Base LMs. For each of 80 short prompts, we collect and align 100 generations (nucleus sampling, p = 0.9) from Base (pretrained) and RLHF models. Above: (§\ref{['sec:anchor_spans']}) A Sankey diagram of 100 RLHF model generations for the prompt "What are the main differences between Python and JavaScript programming languages?" Sequences share multiple lengthy anchor spans which appear verbatim in the same order, forming a uniform skeleton for nearly all generations. Below: (§\ref{['sec:backbones']}) Over the sequence length, the number of generations aligned with at least 5 others, averaged over all prompts. Base model generations maintain low levels of alignment. RLHF model generations exhibit high alignment throughout, but especially near the beginning and end of generations.
  • Figure 2: (§\ref{['sec:ppl']}) RLHF models are significantly worse on language modeling tasks compared to the Base LMs they are adapted from, even on data similar to their preference tuning corpora.(A): Perplexity increase in models post-RLHF compared to the base LLM, across several model families and sizes, evaluated on 9 text corpora grouped into 4 categories. RLHF models consistently underperform the Base models they were tuned from. (The lower the perplexity, the better.) (B): We finetune each model from (A) on the target corpus; the general trend remains unchanged. RLHF models are consistently inferior even post-finetuning. Details in Appendix \ref{['sec:setup-details']}.
  • Figure 3: (§\ref{['sec:collapse']}) RLHF models assign nearly all of the next-token probability mass to a single token, more than Base models. For Base and RLHF models, we calculate the next-token probability distributions on the gold sequences, as well as on the models' own generations (nucleus sampling, p=0.9). We show the cumulative probability mass of the tokens sorted in descending order of probability. RLHF models assign a larger portion of the probability mass to a small number of tokens, compared to Base models. Details are in Appendix \ref{['sec:setup-topk']}.
  • Figure 4: (§\ref{['sec:collapse']}) RLHF models assign near-zero probability mass to almost all tokens when generating, more than Base models. For both 7B and 70B models, RLHF models assign non-negligible ($>10^{-8}$) probability to significantly fewer vocabulary tokens when predicting next tokens from its own generations (nucleus sampling, p = 0.9), but slightly more tokens when predicting next tokens on gold text sequences. This suggests RLHF models only exhibit collapse in their own generative distributions. Details are in Appendix \ref{['sec:setup-topk']}.
  • Figure 5: (§\ref{['sec:collapse']}) RLHF models have lower perplexity when evaluated on their own generations: We generate (nucleus sampling, p=0.9) completions of prefixes taken from 11 datasets, grouped into 4 categories. Details are in Appendix \ref{['tbl: self_ppl_data']}.
  • ...and 11 more figures