Table of Contents
Fetching ...

STRIDE: Automating Reward Design, Deep Reinforcement Learning Training and Feedback Optimization in Humanoid Robotics Locomotion

Zhenwei Wu, Jinxiong Lu, Yuxiao Chen, Yunxin Liu, Yueting Zhuang, Luhui Hu

TL;DR

STRIDE tackles the reward-design bottleneck in humanoid DRL by automating reward generation, training, and feedback using an agentic-engineering framework that leverages LLMs to synthesize executable rewards from environment code and task descriptions. The approach forms a closed loop: the LLM proposes a reward, DRL evaluates it via a fitness metric, and textual feedback guides iterative refinements, all without task-specific prompts. Empirical results show STRIDE outperforming the state-of-the-art Eureka framework across multiple terrains, achieving sprint-like locomotion and robust performance with automated reward design, including a human-init variant that enhances stability. The work demonstrates a scalable pathway to advance humanoid robotics and DRL workflows by integrating environment-grounded reward synthesis with gradient-free human guidance.

Abstract

Humanoid robotics presents significant challenges in artificial intelligence, requiring precise coordination and control of high-degree-of-freedom systems. Designing effective reward functions for deep reinforcement learning (DRL) in this domain remains a critical bottleneck, demanding extensive manual effort, domain expertise, and iterative refinement. To overcome these challenges, we introduce STRIDE, a novel framework built on agentic engineering to automate reward design, DRL training, and feedback optimization for humanoid robot locomotion tasks. By combining the structured principles of agentic engineering with large language models (LLMs) for code-writing, zero-shot generation, and in-context optimization, STRIDE generates, evaluates, and iteratively refines reward functions without relying on task-specific prompts or templates. Across diverse environments featuring humanoid robot morphologies, STRIDE outperforms the state-of-the-art reward design framework EUREKA, achieving an average improvement of round 250% in efficiency and task performance. Using STRIDE-generated rewards, simulated humanoid robots achieve sprint-level locomotion across complex terrains, highlighting its ability to advance DRL workflows and humanoid robotics research.

STRIDE: Automating Reward Design, Deep Reinforcement Learning Training and Feedback Optimization in Humanoid Robotics Locomotion

TL;DR

STRIDE tackles the reward-design bottleneck in humanoid DRL by automating reward generation, training, and feedback using an agentic-engineering framework that leverages LLMs to synthesize executable rewards from environment code and task descriptions. The approach forms a closed loop: the LLM proposes a reward, DRL evaluates it via a fitness metric, and textual feedback guides iterative refinements, all without task-specific prompts. Empirical results show STRIDE outperforming the state-of-the-art Eureka framework across multiple terrains, achieving sprint-like locomotion and robust performance with automated reward design, including a human-init variant that enhances stability. The work demonstrates a scalable pathway to advance humanoid robotics and DRL workflows by integrating environment-grounded reward synthesis with gradient-free human guidance.

Abstract

Humanoid robotics presents significant challenges in artificial intelligence, requiring precise coordination and control of high-degree-of-freedom systems. Designing effective reward functions for deep reinforcement learning (DRL) in this domain remains a critical bottleneck, demanding extensive manual effort, domain expertise, and iterative refinement. To overcome these challenges, we introduce STRIDE, a novel framework built on agentic engineering to automate reward design, DRL training, and feedback optimization for humanoid robot locomotion tasks. By combining the structured principles of agentic engineering with large language models (LLMs) for code-writing, zero-shot generation, and in-context optimization, STRIDE generates, evaluates, and iteratively refines reward functions without relying on task-specific prompts or templates. Across diverse environments featuring humanoid robot morphologies, STRIDE outperforms the state-of-the-art reward design framework EUREKA, achieving an average improvement of round 250% in efficiency and task performance. Using STRIDE-generated rewards, simulated humanoid robots achieve sprint-level locomotion across complex terrains, highlighting its ability to advance DRL workflows and humanoid robotics research.

Paper Structure

This paper contains 23 sections, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: STRIDE pipeline: The framework integrates environment code, task descriptions, LLMs, and reinforcement learning to automate reward generation and optimization for humanoid robot locomotion tasks.
  • Figure 2: STRIDE Agent Framework: Integrating Environment Code, Task Descriptions, LLMs, and DRL for Automated Reward Generation and Optimization in Humanoid Robot Locomotion.
  • Figure 3: Humanoid robots test on different terrains.
  • Figure 4: Comparison of STRIDE and EUREKA on the flat terrain.
  • Figure 5: Comparison of Stride and Eureka on the flat terrains.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 3.1: Reward Design Problem (RDP), adapted from singh2009rewards