Table of Contents
Fetching ...

CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

Richard Bornemann, Pierluigi Vito Amadori, Antoine Cully

TL;DR

CODE-SHARP tackles open-ended skill discovery by introducing Skills as Hierarchical Reward Programs (SHARPs) and a continuously expanding, FM-driven skill archive. It combines two iterative FM processes—discovery (proposal/implementor/judge) and refinement (mutation proposals)—with a goal-conditioned agent trained solely on SHARP rewards, guided by a high-level FM planner that composes SHARPs into policies-in-code. In Craftax, CODE-SHARP yields an average of ~90 SHARPs per run and enables the planner to solve long-horizon benchmarks, outperforming pretrained and expert baselines by over 134% on average. Limitations include reliance on code-based environments for SHARP generation, and future directions point toward extending to non-code settings via learned reward models or natural-language feedback.

Abstract

Developing agents capable of open-endedly discovering and learning novel skills is a grand challenge in Artificial Intelligence. While reinforcement learning offers a powerful framework for training agents to master complex skills, it typically relies on hand-designed reward functions. This is infeasible for open-ended skill discovery, where the set of meaningful skills is not known a priori. While recent methods have shown promising results towards automating reward function design, they remain limited to refining rewards for pre-defined tasks. To address this limitation, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a novel framework leveraging Foundation Models (FM) to open-endedly expand and refine a hierarchical skill archive, structured as a directed graph of executable reward functions in code. We show that a goal-conditioned agent trained exclusively on the rewards generated by the discovered SHARP skills learns to solve increasingly long-horizon goals in the Craftax environment. When composed by a high-level FM-based planner, the discovered skills enable a single goal-conditioned agent to solve complex, long-horizon tasks, outperforming both pretrained agents and task-specific expert policies by over $134$% on average. We will open-source our code and provide additional videos $\href{https://sites.google.com/view/code-sharp/homepage}{here}$.

CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

TL;DR

CODE-SHARP tackles open-ended skill discovery by introducing Skills as Hierarchical Reward Programs (SHARPs) and a continuously expanding, FM-driven skill archive. It combines two iterative FM processes—discovery (proposal/implementor/judge) and refinement (mutation proposals)—with a goal-conditioned agent trained solely on SHARP rewards, guided by a high-level FM planner that composes SHARPs into policies-in-code. In Craftax, CODE-SHARP yields an average of ~90 SHARPs per run and enables the planner to solve long-horizon benchmarks, outperforming pretrained and expert baselines by over 134% on average. Limitations include reliance on code-based environments for SHARP generation, and future directions point toward extending to non-code settings via learned reward models or natural-language feedback.

Abstract

Developing agents capable of open-endedly discovering and learning novel skills is a grand challenge in Artificial Intelligence. While reinforcement learning offers a powerful framework for training agents to master complex skills, it typically relies on hand-designed reward functions. This is infeasible for open-ended skill discovery, where the set of meaningful skills is not known a priori. While recent methods have shown promising results towards automating reward function design, they remain limited to refining rewards for pre-defined tasks. To address this limitation, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a novel framework leveraging Foundation Models (FM) to open-endedly expand and refine a hierarchical skill archive, structured as a directed graph of executable reward functions in code. We show that a goal-conditioned agent trained exclusively on the rewards generated by the discovered SHARP skills learns to solve increasingly long-horizon goals in the Craftax environment. When composed by a high-level FM-based planner, the discovered skills enable a single goal-conditioned agent to solve complex, long-horizon tasks, outperforming both pretrained agents and task-specific expert policies by over % on average. We will open-source our code and provide additional videos .
Paper Structure (28 sections, 4 equations, 7 figures, 9 tables, 2 algorithms)

This paper contains 28 sections, 4 equations, 7 figures, 9 tables, 2 algorithms.

Figures (7)

  • Figure 1: CODE-SHARP consists of two FM-driven iterative processes to discover novel SHARP skills and refine SHARP skills already present in the skill archive. CODE-SHARP utilises a pipeline of FM-based skill proposal generator, implementor, and judge to first generate and filter novel SHARP skills before environment evaluation. Skill refinement is based on the FM-based skill mutation generator and implementor. Skill mutation proposals are then directly evaluated in the environment.
  • Figure 2: Pseudo-Code version of the SHARP skill defining a skill to craft a stone pickaxe.
  • Figure 3: Interconnected archive of discovered SHARP skills. CODE-SHARP continuously builds on existing SHARP skills in the archive to define novel, meaningful skills in line with the natural curriculum of Craftax. Initial skill discovery focuses on the Overworld before progressing to the Dungeon then the Mines and finally the Sewers.
  • Figure 4: ((a) Average score achieved on each benchmark task. CODE-SHARP outperforms the zero-shot ReAct LLM agent, the agent pretrained on environment rewards, and the task experts. (b) Evolution of agent capabilities over the course of open-ended skill discovery. The policy planner utilises increasingly complex SHARP skills to define policies-in-code throughout training, resulting in large performance gains relative to the average baseline.
  • Figure 5: Absolute Benchmark Score vs Proposal Iterations
  • ...and 2 more figures