Enhancing Language Model Rationality with Bi-Directional Deliberation Reasoning

Yadong Zhang; Shaoguang Mao; Wenshan Wu; Yan Xia; Tao Ge; Man Lan; Furu Wei

Enhancing Language Model Rationality with Bi-Directional Deliberation Reasoning

Yadong Zhang, Shaoguang Mao, Wenshan Wu, Yan Xia, Tao Ge, Man Lan, Furu Wei

TL;DR

This work addresses the challenge of rational decision-making in LLMs within uncertain, multi-agent environments. It introduces BIDDER, a bi-directional deliberation framework that infers hidden states from history, explores future trajectories via opponent modeling, and aggregates long-term rewards to maximize expected utility, drawing on decision theory and Q-Learning principles. Through experiments in Limit Texas Hold'em and negotiation, BIDDER demonstrates superior decision rationality and payoff performance compared with unidirectional reasoning baselines, highlighting the practical value of bidirectional planning for LLM agents. The findings suggest that integrating historical context with forward-looking exploration enables more informed, proactive, and strategically sound AI behavior in complex tasks.

Abstract

This paper introduces BI-Directional DEliberation Reasoning (BIDDER), a novel reasoning approach to enhance the decision rationality of language models. Traditional reasoning methods typically rely on historical information and employ uni-directional (left-to-right) reasoning strategy. This lack of bi-directional deliberation reasoning results in limited awareness of potential future outcomes and insufficient integration of historical context, leading to suboptimal decisions. BIDDER addresses this gap by incorporating principles of rational decision-making, specifically managing uncertainty and predicting expected utility. Our approach involves three key processes: Inferring hidden states to represent uncertain information in the decision-making process from historical data; Using these hidden states to predict future potential states and potential outcomes; Integrating historical information (past contexts) and long-term outcomes (future contexts) to inform reasoning. By leveraging bi-directional reasoning, BIDDER ensures thorough exploration of both past and future contexts, leading to more informed and rational decisions. We tested BIDDER's effectiveness in two well-defined scenarios: Poker (Limit Texas Hold'em) and Negotiation. Our experiments demonstrate that BIDDER significantly improves the decision-making capabilities of LLMs and LLM agents.

Enhancing Language Model Rationality with Bi-Directional Deliberation Reasoning

TL;DR

Abstract

Paper Structure (31 sections, 6 equations, 4 figures, 3 tables, 2 algorithms)

This paper contains 31 sections, 6 equations, 4 figures, 3 tables, 2 algorithms.

Introduction
Method
Infer Hidden State From History
Explore Future Trajectories with Opponent Modeling
Aggregate Rewards for Explored Trajectories
Experiments
Limit Texas Hold'em
Baselines
Main Result
Rationalality Evaluation
Negotiation
Baselines
Metrics
Result
Related Work
...and 16 more sections

Figures (4)

Figure 1: Comparing Unidirectional and Bi-Directional Reasoning: Unidirectional reasoning utilizes historical context to make left-to-right decisions. In contrast, bi-directional reasoning explores potential future states and aggregate the expected utility for future moves. It then uses both historical contexts and future exploration contexts to conduct bi-directional reasoning.
Figure 2: BI-Directional DEliberation Reasoning includess: 1. Hidden State Inference: Inferring hidden states (e.g. opponent's strategies) behind decision-making from historical data (e.g. opponent's actions); 2. Future Exploration: Exploring possible future trajectories based on inferred states; 3. Action Reward Collection: Aggregating utilities from predicted trajectories to make decisions. The bi-directional reasoning enhances decision rationality by incorporating both historical context and potential future outcomes.
Figure 3: Direct, BIDDER, and DeepCFR Action Distribution in Limit Texas Hold'em.
Figure 4: The distribution of hand strengths sampled from poker games ranges from 0 to 8, corresponding respectively to High Card, One Pair, Two Pair, Three of a Kind, Straight, Flush, Full House, Four of a Kind, and Straight Flush.

Enhancing Language Model Rationality with Bi-Directional Deliberation Reasoning

TL;DR

Abstract

Enhancing Language Model Rationality with Bi-Directional Deliberation Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)