Table of Contents
Fetching ...

Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs

Yifan Zhou, Sachin Grover, Mohamed El Mistiri, Kamalesh Kalirathnam, Pratyush Kerhalkar, Swaroop Mishra, Neelesh Kumar, Sanket Gaurav, Oya Aran, Heni Ben Amor

TL;DR

Prompted Policy Search (ProPS) positions a large language model at the core of reinforcement learning optimization, fusing numeric reward signals with natural language guidance to propose policy updates in-context. The approach includes a numerics-only variant and a semantically-augmented variant (ProPS$^+$) that injects domain knowledge and hints, improving sample efficiency and interpretability. Empirical validation across 15 Gymnasium tasks shows ProPS and especially ProPS$^+$ achieve strong performance relative to seven standard RL baselines, with notable gains when semantic information is available; however, semantic prompts can bias learning in stochastic environments. The work also demonstrates robustness across multiple LLMs and shows that in-context history and lightweight fine-tuning can further enhance performance, signaling a potential shift toward human-aligned, transparent RL optimizers driven by language models.

Abstract

Reinforcement Learning (RL) traditionally relies on scalar reward signals, limiting its ability to leverage the rich semantic knowledge often available in real-world tasks. In contrast, humans learn efficiently by combining numerical feedback with language, prior knowledge, and common sense. We introduce Prompted Policy Search (ProPS), a novel RL method that unifies numerical and linguistic reasoning within a single framework. Unlike prior work that augment existing RL components with language, ProPS places a large language model (LLM) at the center of the policy optimization loop-directly proposing policy updates based on both reward feedback and natural language input. We show that LLMs can perform numerical optimization in-context, and that incorporating semantic signals, such as goals, domain knowledge, and strategy hints can lead to more informed exploration and sample-efficient learning. ProPS is evaluated across fifteen Gymnasium tasks, spanning classic control, Atari games, and MuJoCo environments, and compared to seven widely-adopted RL algorithms (e.g., PPO, SAC, TRPO). It outperforms all baselines on eight out of fifteen tasks and demonstrates substantial gains when provided with domain knowledge. These results highlight the potential of unifying semantics and numerics for transparent, generalizable, and human-aligned RL.

Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs

TL;DR

Prompted Policy Search (ProPS) positions a large language model at the core of reinforcement learning optimization, fusing numeric reward signals with natural language guidance to propose policy updates in-context. The approach includes a numerics-only variant and a semantically-augmented variant (ProPS) that injects domain knowledge and hints, improving sample efficiency and interpretability. Empirical validation across 15 Gymnasium tasks shows ProPS and especially ProPS achieve strong performance relative to seven standard RL baselines, with notable gains when semantic information is available; however, semantic prompts can bias learning in stochastic environments. The work also demonstrates robustness across multiple LLMs and shows that in-context history and lightweight fine-tuning can further enhance performance, signaling a potential shift toward human-aligned, transparent RL optimizers driven by language models.

Abstract

Reinforcement Learning (RL) traditionally relies on scalar reward signals, limiting its ability to leverage the rich semantic knowledge often available in real-world tasks. In contrast, humans learn efficiently by combining numerical feedback with language, prior knowledge, and common sense. We introduce Prompted Policy Search (ProPS), a novel RL method that unifies numerical and linguistic reasoning within a single framework. Unlike prior work that augment existing RL components with language, ProPS places a large language model (LLM) at the center of the policy optimization loop-directly proposing policy updates based on both reward feedback and natural language input. We show that LLMs can perform numerical optimization in-context, and that incorporating semantic signals, such as goals, domain knowledge, and strategy hints can lead to more informed exploration and sample-efficient learning. ProPS is evaluated across fifteen Gymnasium tasks, spanning classic control, Atari games, and MuJoCo environments, and compared to seven widely-adopted RL algorithms (e.g., PPO, SAC, TRPO). It outperforms all baselines on eight out of fifteen tasks and demonstrates substantial gains when provided with domain knowledge. These results highlight the potential of unifying semantics and numerics for transparent, generalizable, and human-aligned RL.

Paper Structure

This paper contains 59 sections, 3 equations, 21 figures, 9 tables.

Figures (21)

  • Figure 1: Overview of the approach used in ProPS, showing interactions between the environment, LLM, and RL.
  • Figure 2: Summary of the structure, information, and instructions of a task-agnostic ProPS prompt.
  • Figure 3: Summary of the structure, information and instructions utilized to construct domain-specific ProPS$^+$ prompt, with an example of the prompt for CartPole environment.
  • Figure 4: Episodic performance of ProPS and ProPS$^+$ compared to baseline algorithms in the Swimmer task.
  • Figure 5: (a) Without semantic information, ProPS is able to learn a successful policy. Note that this policy avoids most of the possibilities of falling into a hole. By contrast, ProPS$^+$ is provided with task descriptions, but it created a policy that will only work when the environment is deterministic. (b)ProPS$^+$ reaches top performance in 8 out of 15 tasks in our empirical evaluations.
  • ...and 16 more figures