Table of Contents
Fetching ...

Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning

Lynn Cherif, Flemming Kondrup, David Venuto, Ankit Anand, Doina Precup, Khimya Khetarpal

TL;DR

The paper tackles the challenge of RL in GUI-like, large-action, sparse-reward settings by reducing sample complexity through intent-based affordances. It introduces CoGA, a pipeline that uses pre-trained vision-language models to generate executable code that returns affordable actions, with a verification loop to ensure reliability before integrating the code into RL. Empirically, CoGA yields substantial sample-efficiency gains on MiniWob++ tasks, demonstrates generalization within related task families, and remains competitive with behavior cloning when only a small number of expert demonstrations are available. This approach offers a practical path to plug-and-play affordances into RL, potentially broadening the accessibility of RL to data-scarce, real-world GUI environments.

Abstract

Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboard actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through $\textit{intent-based affordances}$ -- i.e., considering in any situation only the subset of actions that achieve a desired outcome. We propose $\textbf{Code as Generative Affordances}$ $(\textbf{$\texttt{CoGA}$})$, a method that leverages pre-trained vision-language models (VLMs) to generate code that determines affordable actions through implicit intent-completion functions and using a fully-automated program generation and verification pipeline. These programs are then used in-the-loop of a reinforcement learning agent to return a set of affordances given a pixel observation. By greatly reducing the number of actions that an agent must consider, we demonstrate on a wide range of tasks in the MiniWob++ benchmark that: $\textbf{1)}$ $\texttt{CoGA}$ is orders of magnitude more sample efficient than its RL agent, $\textbf{2)}$ $\texttt{CoGA}$'s programs can generalize within a family of tasks, and $\textbf{3)}$ $\texttt{CoGA}$ performs better or on par compared with behavior cloning when a small number of expert demonstrations is available.

Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning

TL;DR

The paper tackles the challenge of RL in GUI-like, large-action, sparse-reward settings by reducing sample complexity through intent-based affordances. It introduces CoGA, a pipeline that uses pre-trained vision-language models to generate executable code that returns affordable actions, with a verification loop to ensure reliability before integrating the code into RL. Empirically, CoGA yields substantial sample-efficiency gains on MiniWob++ tasks, demonstrates generalization within related task families, and remains competitive with behavior cloning when only a small number of expert demonstrations are available. This approach offers a practical path to plug-and-play affordances into RL, potentially broadening the accessibility of RL to data-scarce, real-world GUI environments.

Abstract

Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboard actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through -- i.e., considering in any situation only the subset of actions that achieve a desired outcome. We propose \texttt{CoGA}, a method that leverages pre-trained vision-language models (VLMs) to generate code that determines affordable actions through implicit intent-completion functions and using a fully-automated program generation and verification pipeline. These programs are then used in-the-loop of a reinforcement learning agent to return a set of affordances given a pixel observation. By greatly reducing the number of actions that an agent must consider, we demonstrate on a wide range of tasks in the MiniWob++ benchmark that: is orders of magnitude more sample efficient than its RL agent, 's programs can generalize within a family of tasks, and performs better or on par compared with behavior cloning when a small number of expert demonstrations is available.

Paper Structure

This paper contains 41 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Left: Overview of our method, CoGA. The VLM processes available task descriptions and example observations to extract relevant intents (e.g., "click on a tab") and object template images (e.g., every tab), which are then used to generate code that returns a set of affordable actions given an observation. The code is validated and improved by a critique VLM. The set of affordances are then used to mask the action space of the RL agent. Right: Prompting pipeline to generate affordance scripts that return the set of affordable actions.
  • Figure 2: Left: Examples of returned affordances for three tasks (left to right): click-test, click-test-2, click-tabRight: F1-score across tested tasks. We observe that most generated affordance scripts have a high F1-score, implying wide and precise coverage of ground truth affordances.
  • Figure 3: Left: Evaluation success rates at 1000 steps for the RL agent and CoGA across tasks. We observe that CoGA is over 10 times more sample efficient than the RL agent early in training at only 1000 steps. Right: Evaluation success rate curves for the RL agent and CoGA on count-sides (left) and click-test-2 (right)
  • Figure 4: Right: Evaluation success rates across tasks and expert data regimes of the BC agent, the RL agent, and CoGA (mean and standard deviation over 3 seeds). Left: Mean of evaluation success rates across tasks for the RL agent, CoGA, and increasing expert data regimes of the BC agent.
  • Figure 5: Example scripts across 2 successful tasks (click-tab, count-sides) and 2 unsuccessful tasks (use-slider, use-spinner).
  • ...and 1 more figures