Ask more, know better: Reinforce-Learned Prompt Questions for Decision Making with Large Language Models
Xue Yan, Yan Song, Xinyu Cui, Filippos Christianos, Haifeng Zhang, David Henry Mguni, Jun Wang
TL;DR
The paper tackles the brittleness and labor-intensive nature of hand-crafted prompts for LLM-driven decision making by proposing Bilevel-LLM, a bilevel framework where a prompt policy learns to generate task-relevant prompts that trigger CoT reasoning, and an action policy learns to act on the CoT outputs. The CoT process is provided by a fixed, capable LLM, while the prompt policy (outer level) and the action policy (inner level) are trained in a leader-follower loop to minimize the action-policy entropy and maximize environmental rewards, respectively, effectively integrating natural language reasoning with RL. The approach is evaluated across five environments, including Tower of Hanoi, Frozen Lake, ChainWorld, FourRoom, and Overcooked variants, and demonstrates superior performance and stability compared to baselines such as GFlan, Vanilla PPO, and GPT-3.5-based prompts, even when prompts are auto-generated. The work highlights a practical pathway toward generalist AI by reducing human prompt engineering and enabling adaptive, CoT-informed actions, with the potential for extension to multi-agent settings. Overall, Bilevel-LLM advances the integration of reasoning and action under a principled bilevel optimization, delivering practical benefits for complex decision-making tasks.
Abstract
Large language models (LLMs) demonstrate their promise in tackling complicated practical challenges by combining action-based policies with chain of thought (CoT) reasoning. Having high-quality prompts on hand, however, is vital to the framework's effectiveness. Currently, these prompts are handcrafted utilising extensive human labor, resulting in CoT policies that frequently fail to generalise. Human intervention is also required to develop grounding functions that ensure low-level controllers appropriately process CoT reasoning. In this paper, we propose a comprehensive training framework for complex task-solving, incorporating human prior knowledge into the learning of action policies. To that purpose, we offer a new leader-follower bilevel framework that is capable of learning to ask relevant questions (prompts) and subsequently undertaking reasoning to guide the learning of actions. The prompt policy is employed to make introspective revisions based on historical findings, leading the CoT process to consider the anticipated goals and generate outputs that lead to decisive, high-performing actions. The action policy subsequently learns to comprehend and integrate the CoT outputs to take actions. Our empirical data reveal that our framework outperforms leading methods in $5$ decision-making tasks such as Overcooked and FourRoom.
