Table of Contents
Fetching ...

Efficient Language-instructed Skill Acquisition via Reward-Policy Co-Evolution

Changxin Huang, Yanbin Chang, Junfan Lin, Junyang Liang, Runhao Zeng, Jianqiang Li

TL;DR

The paper tackles the challenge of language-guided robotic skill learning by enabling reward functions and policies to co-evolve rather than relying on a single universal reward. It introduces ROSKA, a framework that couples LLM-driven reward evolution with policy evolution guided by Short-Cut Bayesian Optimization, while employing a dynamic reward population to iteratively refine task feedback. A partial inheritance strategy blends prior best policies with random initialization, and BO optimizes the fusion ratio to balance retained knowledge with new learning. Across six Isaac Gym tasks, ROSKA achieves an average normalized improvement of $95.3\%$ with reduced data usage (about $89\%$ of Eureka's data), outperforming sparse, human-designed, and prior LLM-based reward methods and demonstrating strong data efficiency and robustness in high-dimensional robotic control. These results suggest ROSKA significantly advances autonomous, language-instructed robotic learning with practical implications for real-world deployment.

Abstract

The ability to autonomously explore and resolve tasks with minimal human guidance is crucial for the self-development of embodied intelligence. Although reinforcement learning methods can largely ease human effort, it's challenging to design reward functions for real-world tasks, especially for high-dimensional robotic control, due to complex relationships among joints and tasks. Recent advancements large language models (LLMs) enable automatic reward function design. However, approaches evaluate reward functions by re-training policies from scratch placing an undue burden on the reward function, expecting it to be effective throughout the whole policy improvement process. We argue for a more practical strategy in robotic autonomy, focusing on refining existing policies with policy-dependent reward functions rather than a universal one. To this end, we propose a novel reward-policy co-evolution framework where the reward function and the learned policy benefit from each other's progressive on-the-fly improvements, resulting in more efficient and higher-performing skill acquisition. Specifically, the reward evolution process translates the robot's previous best reward function, descriptions of tasks and environment into text inputs. These inputs are used to query LLMs to generate a dynamic amount of reward function candidates, ensuring continuous improvement at each round of evolution. For policy evolution, our method generates new policy populations by hybridizing historically optimal and random policies. Through an improved Bayesian optimization, our approach efficiently and robustly identifies the most capable and plastic reward-policy combination, which then proceeds to the next round of co-evolution. Despite using less data, our approach demonstrates an average normalized improvement of 95.3% across various high-dimensional robotic skill learning tasks.

Efficient Language-instructed Skill Acquisition via Reward-Policy Co-Evolution

TL;DR

The paper tackles the challenge of language-guided robotic skill learning by enabling reward functions and policies to co-evolve rather than relying on a single universal reward. It introduces ROSKA, a framework that couples LLM-driven reward evolution with policy evolution guided by Short-Cut Bayesian Optimization, while employing a dynamic reward population to iteratively refine task feedback. A partial inheritance strategy blends prior best policies with random initialization, and BO optimizes the fusion ratio to balance retained knowledge with new learning. Across six Isaac Gym tasks, ROSKA achieves an average normalized improvement of with reduced data usage (about of Eureka's data), outperforming sparse, human-designed, and prior LLM-based reward methods and demonstrating strong data efficiency and robustness in high-dimensional robotic control. These results suggest ROSKA significantly advances autonomous, language-instructed robotic learning with practical implications for real-world deployment.

Abstract

The ability to autonomously explore and resolve tasks with minimal human guidance is crucial for the self-development of embodied intelligence. Although reinforcement learning methods can largely ease human effort, it's challenging to design reward functions for real-world tasks, especially for high-dimensional robotic control, due to complex relationships among joints and tasks. Recent advancements large language models (LLMs) enable automatic reward function design. However, approaches evaluate reward functions by re-training policies from scratch placing an undue burden on the reward function, expecting it to be effective throughout the whole policy improvement process. We argue for a more practical strategy in robotic autonomy, focusing on refining existing policies with policy-dependent reward functions rather than a universal one. To this end, we propose a novel reward-policy co-evolution framework where the reward function and the learned policy benefit from each other's progressive on-the-fly improvements, resulting in more efficient and higher-performing skill acquisition. Specifically, the reward evolution process translates the robot's previous best reward function, descriptions of tasks and environment into text inputs. These inputs are used to query LLMs to generate a dynamic amount of reward function candidates, ensuring continuous improvement at each round of evolution. For policy evolution, our method generates new policy populations by hybridizing historically optimal and random policies. Through an improved Bayesian optimization, our approach efficiently and robustly identifies the most capable and plastic reward-policy combination, which then proceeds to the next round of co-evolution. Despite using less data, our approach demonstrates an average normalized improvement of 95.3% across various high-dimensional robotic skill learning tasks.

Paper Structure

This paper contains 25 sections, 15 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of main differences between our method and Eureka.
  • Figure 2: Overview of the proposed reward-policy co-evolutionary framework, illustrating the iterative refinement of reward functions and policies through mutual feedback between a large language model (LLM), reinforcement learning (PPO), and Bayesian optimization, enabling efficient and effective skill acquisition.
  • Figure 3: Illustrations of the six robot tasks in our experiment: Ant, Humanoid, ShadowHand, AllegroHand, FrankaCabinet, and ShadowHandUpsideDown.
  • Figure 4: HNS comparison across six robotic tasks, demonstrating that our method consistently outperforms other methods, with substantial improvements across all tasks.
  • Figure 5: MTS comparison showing our method's steady improvement and higher scores over rounds, while Eureka struggles with stability. For details, see the appendix.