Table of Contents
Fetching ...

BWArea Model: Learning World Model, Inverse Dynamics, and Policy for Controllable Language Generation

Chengxing Jia, Pengyuan Wang, Ziniu Li, Yi-Chen Li, Zhilong Zhang, Nan Tang, Yang Yu

TL;DR

BWArea reframes language generation as a decision-making process by decoupling it into a language world model, an inverse dynamics model that infers latent actions, and a cognitive policy that selects actions. This latent-action decomposition reduces predictive variance, improves controllability via downstream rewards, and offers robustness to dirty data compared to fully auto-regressive LLMs. Trained on 30B clean tokens (world 1B, inverse 0.5B, policy 1.1B) and evaluated on benchmarks like MMLU, DROP, BBH, and TruthfulQA, BWArea achieves competitive results and superior task controllability, including TextWorld and BigBench Hard, with additional advantages in data efficiency and data-noise resilience. RL fine-tuning and data-scaling experiments demonstrate practical impact for controllable language generation, suggesting a scalable path toward more aligned and adaptable NLP systems.

Abstract

Large language models (LLMs) have catalyzed a paradigm shift in natural language processing, yet their limited controllability poses a significant challenge for downstream applications. We aim to address this by drawing inspiration from the neural mechanisms of the human brain, specifically Broca's and Wernicke's areas, which are crucial for language generation and comprehension, respectively. In particular, Broca's area receives cognitive decision signals from Wernicke's area, treating the language generation as an intricate decision-making process, which differs from the fully auto-regressive language generation of existing LLMs. In a similar vein, our proposed system, the BWArea model, conceptualizes language generation as a decision-making task. This model has three components: a language world model, an inverse dynamics model, and a cognitive policy. Like Wernicke's area, the inverse dynamics model is designed to deduce the underlying cognitive intentions, or latent actions, behind each token. The BWArea model is amenable to both pre-training and fine-tuning like existing LLMs. With 30B clean pre-training tokens, we have trained a BWArea model, which achieves competitive performance with LLMs of equal size (1B parameters). Unlike fully auto-regressive LLMs, its pre-training performance does not degenerate if dirty data unintentionally appears. This shows the advantage of a decomposed structure of BWArea model in reducing efforts in laborious data selection and labeling. Finally, we reveal that the BWArea model offers enhanced controllability via fine-tuning the cognitive policy with downstream reward metrics, thereby facilitating alignment with greater simplicity. On 9 out of 10 tasks from two suites, TextWorld and BigBench Hard, our method shows superior performance to auto-regressive LLMs.

BWArea Model: Learning World Model, Inverse Dynamics, and Policy for Controllable Language Generation

TL;DR

BWArea reframes language generation as a decision-making process by decoupling it into a language world model, an inverse dynamics model that infers latent actions, and a cognitive policy that selects actions. This latent-action decomposition reduces predictive variance, improves controllability via downstream rewards, and offers robustness to dirty data compared to fully auto-regressive LLMs. Trained on 30B clean tokens (world 1B, inverse 0.5B, policy 1.1B) and evaluated on benchmarks like MMLU, DROP, BBH, and TruthfulQA, BWArea achieves competitive results and superior task controllability, including TextWorld and BigBench Hard, with additional advantages in data efficiency and data-noise resilience. RL fine-tuning and data-scaling experiments demonstrate practical impact for controllable language generation, suggesting a scalable path toward more aligned and adaptable NLP systems.

Abstract

Large language models (LLMs) have catalyzed a paradigm shift in natural language processing, yet their limited controllability poses a significant challenge for downstream applications. We aim to address this by drawing inspiration from the neural mechanisms of the human brain, specifically Broca's and Wernicke's areas, which are crucial for language generation and comprehension, respectively. In particular, Broca's area receives cognitive decision signals from Wernicke's area, treating the language generation as an intricate decision-making process, which differs from the fully auto-regressive language generation of existing LLMs. In a similar vein, our proposed system, the BWArea model, conceptualizes language generation as a decision-making task. This model has three components: a language world model, an inverse dynamics model, and a cognitive policy. Like Wernicke's area, the inverse dynamics model is designed to deduce the underlying cognitive intentions, or latent actions, behind each token. The BWArea model is amenable to both pre-training and fine-tuning like existing LLMs. With 30B clean pre-training tokens, we have trained a BWArea model, which achieves competitive performance with LLMs of equal size (1B parameters). Unlike fully auto-regressive LLMs, its pre-training performance does not degenerate if dirty data unintentionally appears. This shows the advantage of a decomposed structure of BWArea model in reducing efforts in laborious data selection and labeling. Finally, we reveal that the BWArea model offers enhanced controllability via fine-tuning the cognitive policy with downstream reward metrics, thereby facilitating alignment with greater simplicity. On 9 out of 10 tasks from two suites, TextWorld and BigBench Hard, our method shows superior performance to auto-regressive LLMs.
Paper Structure (35 sections, 8 equations, 9 figures, 7 tables, 4 algorithms)

This paper contains 35 sections, 8 equations, 9 figures, 7 tables, 4 algorithms.

Figures (9)

  • Figure 1: An example of how our BWArea model mimics the human brain for language processing.
  • Figure 2: Framework of our architecture. (a) Inverse Dynamics Model: input the context $(x_1, \ldots, x_t)$ with the future $x_{t+1}$ to output the latent action $a_t$. (b) Policy Model: input the context $(x_1, \ldots, x_t)$ without future to obtain the current action categorical distribution. (c) Language World Model: input the context $(x_1, \ldots, x_t)$ and latent action $(a_1, \ldots, a_t)$ to predict the next token.
  • Figure 3: The average accuracy on BBH (7 Tasks).
  • Figure 4: An illustration on Tw-Treasure Hunter game.
  • Figure 5: Training Reward Curves of Reinforcement Learning on TextWorld.
  • ...and 4 more figures