Predicting vs. Acting: A Trade-off Between World Modeling & Agent Modeling
Margaret Li, Weijia Shi, Artidoro Pagnoni, Peter West, Ari Holtzman
TL;DR
The paper investigates a fundamental trade-off between world modeling (next-token prediction) and agent modeling (interactive, goal-directed generation) in RLHF-tuned LLMs. Through cross-model perplexity analyses, hidden-state probing, n-gram and MAFFT-based alignment, and blueprint visualization (anchor spans), it shows that RLHF aligns models toward action at the cost of broad world-modeling capabilities, concentrating probability onto self-predictable spans. The key findings are that RLHF models underperform Base LMs on next-token prediction, exhibit distribution collapse with anchor spans acting as implicit blueprints, and use forward-predictive hidden representations to plan ahead, suggesting a general, perhaps inevitable, trade-off between predicting wide distributions and acting coherently within a narrower subspace. The work argues for integrating world and agent modeling to retain broad predictive power while enabling robust, long-horizon planning, with implications for the design of future aligned AI systems.
Abstract
RLHF-aligned LMs have shown unprecedented ability on both benchmarks and long-form text generation, yet they struggle with one foundational task: next-token prediction. As RLHF models become agent models aimed at interacting with humans, they seem to lose their world modeling -- the ability to predict what comes next in arbitrary documents, which is the foundational training objective of the Base LMs that RLHF adapts. Besides empirically demonstrating this trade-off, we propose a potential explanation: to perform coherent long-form generation, RLHF models restrict randomness via implicit blueprints. In particular, RLHF models concentrate probability on sets of anchor spans that co-occur across multiple generations for the same prompt, serving as textual scaffolding but also limiting a model's ability to generate documents that do not include these spans. We study this trade-off on the most effective current agent models, those aligned with RLHF, while exploring why this may remain a fundamental trade-off between models that act and those that predict, even as alignment techniques improve.
