Table of Contents
Fetching ...

MaestroMotif: Skill Design from Artificial Intelligence Feedback

Martin Klissarov, Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, Marlos C. Machado, Pierluca D'Oro

TL;DR

MaestroMotif addresses the challenge of injecting human knowledge into AI agents by enabling AI-assisted design of low-level skills from natural-language descriptions. It combines LLM-based reward design (via Motif), LLM-generated initiation/termination code, and a training-time policy over skills to learn robust sub-policies with RL, then deploys a code-generated policy that composes these skills without additional training. Evaluated on the NetHack Learning Environment, MaestroMotif achieves strong zero-shot performance across navigation, interaction, and composite tasks, outperforming baselines that rely on task-specific rewards or score maximization. The work highlights the potential of human-AI collaboration to automate complex policy design, leveraging the strengths of LLMs for abstraction and planning with RL for low-level control, and points to future directions in online adaptation and broader applicability.

Abstract

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

MaestroMotif: Skill Design from Artificial Intelligence Feedback

TL;DR

MaestroMotif addresses the challenge of injecting human knowledge into AI agents by enabling AI-assisted design of low-level skills from natural-language descriptions. It combines LLM-based reward design (via Motif), LLM-generated initiation/termination code, and a training-time policy over skills to learn robust sub-policies with RL, then deploys a code-generated policy that composes these skills without additional training. Evaluated on the NetHack Learning Environment, MaestroMotif achieves strong zero-shot performance across navigation, interaction, and composite tasks, outperforming baselines that rely on task-specific rewards or score maximization. The work highlights the potential of human-AI collaboration to automate complex policy design, leveraging the strengths of LLMs for abstraction and planning with RL for low-level control, and points to future directions in online adaptation and broader applicability.

Abstract

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

Paper Structure

This paper contains 24 sections, 3 equations, 25 figures, 3 tables.

Figures (25)

  • Figure 1: Performance across NLE task categories. MaestroMotif largely outperforms existing methods zero-shot, including the ones trained on each task.
  • Figure 2: AI-assisted Skill Design with MaestroMotif. 1. An agent designer provides skills descriptions, which get converted to reward functions $r_{\varphi_1}$ by training on the preferences of an LLM on a dataset of interactions. 2. The agent designer describes initiation and termination functions, $\mathcal{I}_{ \omega_{\{1,\dots,n\}} }$ and $\beta_{ \omega_{\{1,\dots,n\}} }$ to the LLM, which instantiates them by generating code. 3. The agent designer describes a train-time policy over skills $\pi_T$ which the LLM generates via coding. 4. Each skill policy $\pi_{\omega_i}$ is trained to maximize its corresponding reward $r_{\varphi_i}$. Whenever a skill terminates (see open/closed circuit), the policy over skills chooses a new one from the set of available skills.
  • Figure 2: Description of the composite tasks and success rate of MaestroMotif and baselines. Using a code policy allows MaestroMotif to compose skills by applying sophisticated logic, requiring memory or reasoning over a higher-level time abstraction. This is impossible to achieve for a zero-shot LLM policy, and hard to learn via a single reward function, which explains the failures of the baselines.
  • Figure 3: Generation of policy over skills during deployment. The LLM takes a task description and a template as an input, and implements the code for the policy over skills as a skill selection function. Running the code yields a policy over skills that commands a skill neural network by sending the appropriate skill index. Initiation and termination functions, determining which skills can be activated and when a skill execution should terminate, are omitted from the diagram.
  • Figure 4: Simplified depiction of the early NetHack game where significant areas (such as branches) and entities are labeled.
  • ...and 20 more figures