Table of Contents
Fetching ...

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

Thomas Schmied, Jörg Bornschein, Jordi Grau-Moya, Markus Wulfmeier, Razvan Pascanu

TL;DR

This work investigates why LLMs falter in decision-making tasks and identifies greediness, frequency bias, and the knowing-doing gap as core failure modes. It introduces Reinforcement Learning Fine-Tuning (RLFT) on self-generated Chain-of-Thought rationales to improve exploration and action selection, evaluated across multi-armed bandits, contextual bandits, and Tic-tac-toe. Results show that RLFT enhances decision-making, reduces greediness, and narrows the knowing-doing gap, though exploration remains suboptimal compared to traditional bandit algorithms; combining RLFT with classic or LLMSpecific exploration strategies yields further gains. The findings highlight the importance of CoT-based reasoning and reward shaping for steering LLMs toward more reliable, goal-aligned behavior in agentic settings, while outlining avenues for scaling and richer environments.

Abstract

The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as $ε$-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

TL;DR

This work investigates why LLMs falter in decision-making tasks and identifies greediness, frequency bias, and the knowing-doing gap as core failure modes. It introduces Reinforcement Learning Fine-Tuning (RLFT) on self-generated Chain-of-Thought rationales to improve exploration and action selection, evaluated across multi-armed bandits, contextual bandits, and Tic-tac-toe. Results show that RLFT enhances decision-making, reduces greediness, and narrows the knowing-doing gap, though exploration remains suboptimal compared to traditional bandit algorithms; combining RLFT with classic or LLMSpecific exploration strategies yields further gains. The findings highlight the importance of CoT-based reasoning and reward shaping for steering LLMs toward more reliable, goal-aligned behavior in agentic settings, while outlining avenues for scaling and richer environments.

Abstract

The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as -greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.

Paper Structure

This paper contains 35 sections, 3 equations, 22 figures, 1 table.

Figures (22)

  • Figure 1: Illustration of our Reinforcement Learning Fine Tuning (RLFT) pipeline. We fine-tune a pre-trained LLM $\pi_{\theta}$ via self-generated Chain-of-Thought (CoT) rationales on environment rewards.
  • Figure 2: Illustration of a Gaussian MAB for the button scenario from nie2024evolve using our context representation and instructions.
  • Figure 3: Illustration of Greediness. We show action coverage for Gemma2 2B/9B/27B w/ and w/o CoT for (a) 10 and (b) 20 arms over 50 interaction steps. Agents favor the best performing action among the set of selected actions, leading to stagnating action coverage, despite benefits of larger models and CoT. In (c), we plot cumulative regret against action coverage. The agents exhibit suboptimal regret, because of greedy action selection strategies.
  • Figure 4: Illustration of Frequency Bias. We plot the frequency of the repeated action in the context against the action entropy across all actions for 10 armed MABs. (a) Gemma2 2B heavily suffers from frequency bias, becoming more certain of the most frequent action, the more often it occurs in the context. (c) Gemma2 27B overcomes the frequency bias, but instead behaves greedily. In (b) we show the action strategies for three repetition windows.
  • Figure 5: Confusion matrix for the Knowing-Doing Gap of Gemma2 27B. The agent “knows” how to solve the task (87% correct rationales, sum of top row), but fails at "doing" (58% greedy actions among correct rationales). See Figure \ref{['fig:ucb-agent-knowing-doing']}, for instructions and an agent response.
  • ...and 17 more figures