Table of Contents
Fetching ...

OCALM: Object-Centric Assessment with Language Models

Timo Kaufmann, Jannis Blüml, Antonia Wüst, Quentin Delfosse, Kristian Kersting, Eyke Hüllermeier

TL;DR

Object-Centric Assessment with Language Models (OCALM) is proposed to derive inherently interpretable reward functions for RL agents from natural language task descriptions, providing RL agents with the ability to derive policies from task descriptions.

Abstract

Properly defining a reward signal to efficiently train a reinforcement learning (RL) agent is a challenging task. Designing balanced objective functions from which a desired behavior can emerge requires expert knowledge, especially for complex environments. Learning rewards from human feedback or using large language models (LLMs) to directly provide rewards are promising alternatives, allowing non-experts to specify goals for the agent. However, black-box reward models make it difficult to debug the reward. In this work, we propose Object-Centric Assessment with Language Models (OCALM) to derive inherently interpretable reward functions for RL agents from natural language task descriptions. OCALM uses the extensive world-knowledge of LLMs while leveraging the object-centric nature common to many environments to derive reward functions focused on relational concepts, providing RL agents with the ability to derive policies from task descriptions.

OCALM: Object-Centric Assessment with Language Models

TL;DR

Object-Centric Assessment with Language Models (OCALM) is proposed to derive inherently interpretable reward functions for RL agents from natural language task descriptions, providing RL agents with the ability to derive policies from task descriptions.

Abstract

Properly defining a reward signal to efficiently train a reinforcement learning (RL) agent is a challenging task. Designing balanced objective functions from which a desired behavior can emerge requires expert knowledge, especially for complex environments. Learning rewards from human feedback or using large language models (LLMs) to directly provide rewards are promising alternatives, allowing non-experts to specify goals for the agent. However, black-box reward models make it difficult to debug the reward. In this work, we propose Object-Centric Assessment with Language Models (OCALM) to derive inherently interpretable reward functions for RL agents from natural language task descriptions. OCALM uses the extensive world-knowledge of LLMs while leveraging the object-centric nature common to many environments to derive reward functions focused on relational concepts, providing RL agents with the ability to derive policies from task descriptions.
Paper Structure (17 sections, 1 equation, 4 figures, 4 tables)

This paper contains 17 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Contrary to RL agents, humans infer objectives from context. The RL setting assumes the existence of an external reward function, wheres humans are able to infer rewards from information about the environment and task context.
  • Figure 2: Object-Centric Assessment with Language Models. OCALM extracts a neurosymbolic abstraction from the raw state, provided to a language model together with the game's context, to generate a symbolic reward function (in python). The language model first generates relational utility functions, that are then used in the reward function. This transparent reward can be inspected and used to train the policy.
  • Figure 3: OCALM generates meaningful reward functions that correlate with the intended game rewards. These figures show the performance of agents trained on OCALM-derived rewards, measured on both the OCALM-derived reward and the environment reward. The scales of rewards differ, therefore the axes are scaled to better visualize the correlation. Both plot for the same game share the same axis range for better comparability. The results indicate that (1) the reward functions generated by OCALM correspond to objectives learnable by an RL agent, and (2) the OCALM-derived rewards correlate with the environment rewards. All experiments were averaged over $3$ seeds, with standard deviations shown as shaded areas.
  • Figure 4: OCALM agents can master different Atari environments. Comparing the performance of agents trained on OCALM-derived rewards to agents trained on the true game score. All experiments were averaged over $3$ seeds, with standard deviations shown as shaded areas.