Table of Contents
Fetching ...

Reward-Zero: Language Embedding Driven Implicit Reward Mechanisms for Reinforcement Learning

Heng Zhang, Haddy Alchaer, Arash Ajoudani, Yu She

TL;DR

Reward-Zero serves as a simple yet sophisticated universal reward function that leverages language embeddings for efficient RL training that produces a continuous, semantically aligned sense-of-completion signal.

Abstract

We introduce Reward-Zero, a general-purpose implicit reward mechanism that transforms natural-language task descriptions into dense, semantically grounded progress signals for reinforcement learning (RL). Reward-Zero serves as a simple yet sophisticated universal reward function that leverages language embeddings for efficient RL training. By comparing the embedding of a task specification with embeddings derived from an agent's interaction experience, Reward-Zero produces a continuous, semantically aligned sense-of-completion signal. This reward supplements sparse or delayed environmental feedback without requiring task-specific engineering. When integrated into standard RL frameworks, it accelerates exploration, stabilizes training, and enhances generalization across diverse tasks. Empirically, agents trained with Reward-Zero converge faster and achieve higher final success rates than conventional methods such as PPO with common reward-shaping baselines, successfully solving tasks that hand-designed rewards could not in some complex tasks. In addition, we develop a mini benchmark for the evaluation of completion sense during task execution via language embeddings. These results highlight the promise of language-driven implicit reward functions as a practical path toward more sample-efficient, generalizable, and scalable RL for embodied agents. Code will be released after peer review.

Reward-Zero: Language Embedding Driven Implicit Reward Mechanisms for Reinforcement Learning

TL;DR

Reward-Zero serves as a simple yet sophisticated universal reward function that leverages language embeddings for efficient RL training that produces a continuous, semantically aligned sense-of-completion signal.

Abstract

We introduce Reward-Zero, a general-purpose implicit reward mechanism that transforms natural-language task descriptions into dense, semantically grounded progress signals for reinforcement learning (RL). Reward-Zero serves as a simple yet sophisticated universal reward function that leverages language embeddings for efficient RL training. By comparing the embedding of a task specification with embeddings derived from an agent's interaction experience, Reward-Zero produces a continuous, semantically aligned sense-of-completion signal. This reward supplements sparse or delayed environmental feedback without requiring task-specific engineering. When integrated into standard RL frameworks, it accelerates exploration, stabilizes training, and enhances generalization across diverse tasks. Empirically, agents trained with Reward-Zero converge faster and achieve higher final success rates than conventional methods such as PPO with common reward-shaping baselines, successfully solving tasks that hand-designed rewards could not in some complex tasks. In addition, we develop a mini benchmark for the evaluation of completion sense during task execution via language embeddings. These results highlight the promise of language-driven implicit reward functions as a practical path toward more sample-efficient, generalizable, and scalable RL for embodied agents. Code will be released after peer review.
Paper Structure (19 sections, 5 equations, 9 figures, 2 tables)

This paper contains 19 sections, 5 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Conceptual comparison of human learning, traditional RL, and the proposed Reward-Zero. The left panel illustrates how Human Learning is intuitive and implicit, driven by visual matching and a generalized "Sense of Completion," without relying on explicit rewards, mathematical environment models, or exact object world coordinates. The center panel depicts Traditional RL training a simple robotic policy, a rigid and explicit approach that typically requires a hand-crafted reward function, a precise environment model, heavy observation and exact object coordinates to function. The right panel introduces Reward-Zero, our flexible and language-driven method that represents a sophisticated universal reward function. This mechanism uses a general-purpose language embedding-driven implicit reward mechanism to generate a continuous sense-of-completion signal by comparing task and experience embeddings. As shown, Reward-Zero aims to eliminate hand-crafted rewards by relying only on raw language embedding, enhancing generalization across diverse tasks. Zero here signifies the absence of hand-crafted rewards, without explicit reward engineering. This is the Zero step toward more general, adaptable, and scalable RL that can learn from natural language descriptions and raw observations, much like humans do.
  • Figure 2: Example tasks and keyframes from the completion-sense mini benchmark. Each episode contains 2--4 annotated keyframes at known completion percentages (0%, 33%, 50%, 66%, 100%) extracted from successful ManiSkill gu2023maniskill2 rollouts. The benchmark includes tasks with varying visual complexity, from large state changes (e.g., OpenCabinetDrawer) to fine-grained manipulations (e.g., PegInsertionSide).
  • Figure 3: CLIP potential $\Phi(s)$ vs. task completion (%) for each benchmark task (CLIP-direct, $\alpha=0.7$). OpenCabinetDrawer shows two episodes (solid/dashed) to illustrate consistency across initial configurations. All tasks now use four keyframes at 0%, 33%, 66%, and 100% completion. Tasks with large visual changes show strong monotonic trends; fine-manipulation tasks exhibit smaller potential ranges.
  • Figure 4: The AnymalC-Reach task involves a quadruped robot learning to navigate to a target location. The task requires the agent to understand spatial relationships and adapt its locomotion strategy accordingly. This environment serves as a challenging testbed for evaluating the effectiveness of our Reward-Zero in guiding learning through language-driven implicit rewards.
  • Figure 5: Performance comparison of the ANYmal-C Reach-PPO baseline versus our approach. Solid lines represent the mean values over $2\text{M}$ training steps. Our method significantly outperforms the baseline in both task success rates and cumulative reward.
  • ...and 4 more figures