Assessing the Zero-Shot Capabilities of LLMs for Action Evaluation in RL

Eduardo Pignatelli; Johan Ferret; Tim Rockäschel; Edward Grefenstette; Davide Paglieri; Samuel Coward; Laura Toni

Assessing the Zero-Shot Capabilities of LLMs for Action Evaluation in RL

Eduardo Pignatelli, Johan Ferret, Tim Rockäschel, Edward Grefenstette, Davide Paglieri, Samuel Coward, Laura Toni

TL;DR

This work addresses temporal credit assignment in reinforcement learning under sparse, delayed rewards by proposing Credit Assignment with Language Models (CALM), a framework that uses large language models to decompose tasks into subgoals and to assess subgoal achievement in state-action transitions. By treating the LLM as a shaping function and a subgoal verifier, CALM provides auxiliary rewards when options terminate, enabling zero-shot credit shaping without human-designed rewards or fine-tuning. The authors present a formal preliminaries section and an offline evaluation on the MiniHack/KeyRoom environment, showing that LLMs can understand goal specifications, verify termination, and suggest plausible subgoals, with performance influenced by observation type and model size. The results suggest that LLMs can serve as effective priors for credit assignment, potentially improving sample efficiency and transferring human knowledge into value functions, while highlighting limitations and avenues for online, multimodal extensions in future work.

Abstract

The temporal credit assignment problem is a central challenge in Reinforcement Learning (RL), concerned with attributing the appropriate influence to each actions in a trajectory for their ability to achieve a goal. However, when feedback is delayed and sparse, the learning signal is poor, and action evaluation becomes harder. Canonical solutions, such as reward shaping and options, require extensive domain knowledge and manual intervention, limiting their scalability and applicability. In this work, we lay the foundations for Credit Assignment with Language Models (CALM), a novel approach that leverages Large Language Models (LLMs) to automate credit assignment via reward shaping and options discovery. CALM uses LLMs to decompose a task into elementary subgoals and assess the achievement of these subgoals in state-action transitions. Every time an option terminates, a subgoal is achieved, and CALM provides an auxiliary reward. This additional reward signal can enhance the learning process when the task reward is sparse and delayed without the need for human-designed rewards. We provide a preliminary evaluation of CALM using a dataset of human-annotated demonstrations from MiniHack, suggesting that LLMs can be effective in assigning credit in zero-shot settings, without examples or LLM fine-tuning. Our preliminary results indicate that the knowledge of LLMs is a promising prior for credit assignment in RL, facilitating the transfer of human knowledge into value functions.

Assessing the Zero-Shot Capabilities of LLMs for Action Evaluation in RL

TL;DR

Abstract

Paper Structure (31 sections, 3 equations, 4 figures, 12 tables)

This paper contains 31 sections, 3 equations, 4 figures, 12 tables.

Introduction
Related work
for .
LLMs for reward shaping.
LLMs for knowledge transfer.
Preliminaries
Methods
Reward shaping
LLMs as shaping functions
Experimental protocol
Environment.
Dataset.
Composing the prompt.
Models.
Annotations.
...and 16 more sections

Figures (4)

Figure 1: F1 score as a function of the size.
Figure 2: Variation in F1 score between the baseline results presented in Tables \ref{['tab:res:human-preset-balanced']}-\ref{['tab:res:crop-suggested-balanced']} and the results without a token separator in Tables \ref{['tab:app:ablation-token-preset-gamescreen']}-\ref{['tab:app:ablation-token-identify-cropped']}. Yellow bars indicate worse performance without a separator. and blue otherwise.
Figure 3: Tokenisation of the same prompt, with (\ref{['fig:tokenisation-with-sep']}) and without (\ref{['fig:tokenisation-without-sep']}) a token separator (whitespace).
Figure 4: Variation in F1 score between the baseline results presented in Tables \ref{['tab:res:human-preset-balanced']}-\ref{['tab:res:crop-suggested-balanced']} and the results where prompts also include the action in Tables \ref{['tab:app:ablation-action-preset-gamescreen']}-\ref{['tab:app:ablation-action-identify-cropped']}.

Assessing the Zero-Shot Capabilities of LLMs for Action Evaluation in RL

TL;DR

Abstract

Assessing the Zero-Shot Capabilities of LLMs for Action Evaluation in RL

Authors

TL;DR

Abstract

Table of Contents

Figures (4)