Ask more, know better: Reinforce-Learned Prompt Questions for Decision Making with Large Language Models

Xue Yan; Yan Song; Xinyu Cui; Filippos Christianos; Haifeng Zhang; David Henry Mguni; Jun Wang

Ask more, know better: Reinforce-Learned Prompt Questions for Decision Making with Large Language Models

Xue Yan, Yan Song, Xinyu Cui, Filippos Christianos, Haifeng Zhang, David Henry Mguni, Jun Wang

TL;DR

The paper tackles the brittleness and labor-intensive nature of hand-crafted prompts for LLM-driven decision making by proposing Bilevel-LLM, a bilevel framework where a prompt policy learns to generate task-relevant prompts that trigger CoT reasoning, and an action policy learns to act on the CoT outputs. The CoT process is provided by a fixed, capable LLM, while the prompt policy (outer level) and the action policy (inner level) are trained in a leader-follower loop to minimize the action-policy entropy and maximize environmental rewards, respectively, effectively integrating natural language reasoning with RL. The approach is evaluated across five environments, including Tower of Hanoi, Frozen Lake, ChainWorld, FourRoom, and Overcooked variants, and demonstrates superior performance and stability compared to baselines such as GFlan, Vanilla PPO, and GPT-3.5-based prompts, even when prompts are auto-generated. The work highlights a practical pathway toward generalist AI by reducing human prompt engineering and enabling adaptive, CoT-informed actions, with the potential for extension to multi-agent settings. Overall, Bilevel-LLM advances the integration of reasoning and action under a principled bilevel optimization, delivering practical benefits for complex decision-making tasks.

Abstract

Large language models (LLMs) demonstrate their promise in tackling complicated practical challenges by combining action-based policies with chain of thought (CoT) reasoning. Having high-quality prompts on hand, however, is vital to the framework's effectiveness. Currently, these prompts are handcrafted utilising extensive human labor, resulting in CoT policies that frequently fail to generalise. Human intervention is also required to develop grounding functions that ensure low-level controllers appropriately process CoT reasoning. In this paper, we propose a comprehensive training framework for complex task-solving, incorporating human prior knowledge into the learning of action policies. To that purpose, we offer a new leader-follower bilevel framework that is capable of learning to ask relevant questions (prompts) and subsequently undertaking reasoning to guide the learning of actions. The prompt policy is employed to make introspective revisions based on historical findings, leading the CoT process to consider the anticipated goals and generate outputs that lead to decisive, high-performing actions. The action policy subsequently learns to comprehend and integrate the CoT outputs to take actions. Our empirical data reveal that our framework outperforms leading methods in $5$ decision-making tasks such as Overcooked and FourRoom.

Ask more, know better: Reinforce-Learned Prompt Questions for Decision Making with Large Language Models

TL;DR

Abstract

decision-making tasks such as Overcooked and FourRoom.

Paper Structure (29 sections, 5 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 5 equations, 10 figures, 5 tables, 1 algorithm.

Introduction
Problem Formulation
Methodology
prompt policy training via policy gradient.
Experiments
Environments
Baselines
Ablation Studies
Related Work
Conclusion
Additional description about the methodology
Hyperparameter
Additional Description of Experiments
Environments
Prompts and CoT examples
...and 14 more sections

Figures (10)

Figure 1: Top: Example of the workflow from prompt candidates to CoT reasoning on Overcooked. The prompt policy first selects a prompt question from the candidate set. Subsequently, the CoT process generates complex reasoning guided by the prompt and the current state situation to assist in the subsequent action performing. Bottom: The illustration of our bilevel optimisation framework.
Figure 2: Results of comparison with baselines. We plot the mean and standard error of nomalized reward averaged over $5$ seeds for trainable baselines, and over $20$ episodes for GPT-3.5 baselines. The cumulative rewards are normalized within the range $[0, 1]$, and the Area Under the Curve (AUC) is calculated by averaging over the entire training process.
Figure 3: Ablation studies. (a) The effect of different prompt generation strategies. (b) Verficiation of the effectiveness of Bilevel-LLM under multimodal state representations on ChainWorld. (c)Automatically generate prompts on Overcooked(Salad). Left: Normalized AUC reward. Right: Rewards during training.
Figure 4: Ablation of the entropy objective on Chainworld (Partial). Left: Normalized AUC reward. Right: Entropy of the action policy.
Figure 5: An example of the step-by-step inference process of $\text{Bilevel-LLM}$ on the Overcooked task.
...and 5 more figures

Ask more, know better: Reinforce-Learned Prompt Questions for Decision Making with Large Language Models

TL;DR

Abstract

Ask more, know better: Reinforce-Learned Prompt Questions for Decision Making with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)