Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning

Lynn Cherif; Flemming Kondrup; David Venuto; Ankit Anand; Doina Precup; Khimya Khetarpal

Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning

Lynn Cherif, Flemming Kondrup, David Venuto, Ankit Anand, Doina Precup, Khimya Khetarpal

TL;DR

The paper tackles the challenge of RL in GUI-like, large-action, sparse-reward settings by reducing sample complexity through intent-based affordances. It introduces CoGA, a pipeline that uses pre-trained vision-language models to generate executable code that returns affordable actions, with a verification loop to ensure reliability before integrating the code into RL. Empirically, CoGA yields substantial sample-efficiency gains on MiniWob++ tasks, demonstrates generalization within related task families, and remains competitive with behavior cloning when only a small number of expert demonstrations are available. This approach offers a practical path to plug-and-play affordances into RL, potentially broadening the accessibility of RL to data-scarce, real-world GUI environments.

Abstract

Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboard actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through $\textit{intent-based affordances}$ -- i.e., considering in any situation only the subset of actions that achieve a desired outcome. We propose $\textbf{Code as Generative Affordances}$ $(\textbf{$\texttt{CoGA}$})$, a method that leverages pre-trained vision-language models (VLMs) to generate code that determines affordable actions through implicit intent-completion functions and using a fully-automated program generation and verification pipeline. These programs are then used in-the-loop of a reinforcement learning agent to return a set of affordances given a pixel observation. By greatly reducing the number of actions that an agent must consider, we demonstrate on a wide range of tasks in the MiniWob++ benchmark that: $\textbf{1)}$ $\texttt{CoGA}$ is orders of magnitude more sample efficient than its RL agent, $\textbf{2)}$ $\texttt{CoGA}$'s programs can generalize within a family of tasks, and $\textbf{3)}$ $\texttt{CoGA}$ performs better or on par compared with behavior cloning when a small number of expert demonstrations is available.

Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning

TL;DR

Abstract

Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)