Table of Contents
Fetching ...

Adaptive Reward Design for Reinforcement Learning

Minjae Kwon, Ingy ElSayed-Aly, Lu Feng

TL;DR

This work tackles RL with complex, formally specified tasks by leveraging co-safe LTL translated into a DFA and introducing an adaptive reward shaping framework. By defining a distance-to-acceptance $d_{\varphi}(q)$ and a task progression measure $\rho_{\varphi}(q,q')$, the authors design two base reward functions and an adaptive mechanism that updates progress signals during learning, aligning policy optimization with maximal task progression $b^*$. Empirical results across discrete and continuous domains show that adaptive progression and adaptive hybrid rewards typically achieve earlier convergence, higher expected return, and greater task completion rates than baselines, while remaining compatible with algorithms like DQN, DDQN, DDPG, PPO, and A2C. The approach provides a principled path to robust RL under uncertainty with formal task specifications, and offers practical guidance on hyperparameters and applicability to a range of environments. Overall, the paper demonstrates that dynamic, task-aware reward shaping can significantly improve RL performance in LTL-specified tasks.

Abstract

There is a surge of interest in using formal languages such as Linear Temporal Logic (LTL) to precisely and succinctly specify complex tasks and derive reward functions for Reinforcement Learning (RL). However, existing methods often assign sparse rewards (e.g., giving a reward of 1 only if a task is completed and 0 otherwise). By providing feedback solely upon task completion, these methods fail to encourage successful subtask completion. This is particularly problematic in environments with inherent uncertainty, where task completion may be unreliable despite progress on intermediate goals. To address this limitation, we propose a suite of reward functions that incentivize an RL agent to complete a task specified by an LTL formula as much as possible, and develop an adaptive reward shaping approach that dynamically updates reward functions during the learning process. Experimental results on a range of benchmark RL environments demonstrate that the proposed approach generally outperforms baselines, achieving earlier convergence to a better policy with higher expected return and task completion rate.

Adaptive Reward Design for Reinforcement Learning

TL;DR

This work tackles RL with complex, formally specified tasks by leveraging co-safe LTL translated into a DFA and introducing an adaptive reward shaping framework. By defining a distance-to-acceptance and a task progression measure , the authors design two base reward functions and an adaptive mechanism that updates progress signals during learning, aligning policy optimization with maximal task progression . Empirical results across discrete and continuous domains show that adaptive progression and adaptive hybrid rewards typically achieve earlier convergence, higher expected return, and greater task completion rates than baselines, while remaining compatible with algorithms like DQN, DDQN, DDPG, PPO, and A2C. The approach provides a principled path to robust RL under uncertainty with formal task specifications, and offers practical guidance on hyperparameters and applicability to a range of environments. Overall, the paper demonstrates that dynamic, task-aware reward shaping can significantly improve RL performance in LTL-specified tasks.

Abstract

There is a surge of interest in using formal languages such as Linear Temporal Logic (LTL) to precisely and succinctly specify complex tasks and derive reward functions for Reinforcement Learning (RL). However, existing methods often assign sparse rewards (e.g., giving a reward of 1 only if a task is completed and 0 otherwise). By providing feedback solely upon task completion, these methods fail to encourage successful subtask completion. This is particularly problematic in environments with inherent uncertainty, where task completion may be unreliable despite progress on intermediate goals. To address this limitation, we propose a suite of reward functions that incentivize an RL agent to complete a task specified by an LTL formula as much as possible, and develop an adaptive reward shaping approach that dynamically updates reward functions during the learning process. Experimental results on a range of benchmark RL environments demonstrate that the proposed approach generally outperforms baselines, achieving earlier convergence to a better policy with higher expected return and task completion rate.

Paper Structure

This paper contains 17 sections, 4 theorems, 14 equations, 8 figures.

Key Result

Theorem 1

Given an episodic MDP ${\mathcal{M}}$ and a DFA ${\mathcal{A}}_\varphi$ corresponding to a co-safe LTL formula $\varphi$, there exists an optimal policy $\pi^*$ of the product MDP ${\mathcal{M}}^\otimes = {\mathcal{M}} \otimes {\mathcal{A}}_\varphi$ that maximizes the expected return based on a rewa

Figures (8)

  • Figure 1: Example gridworld and a DFA ${\mathcal{A}}_\varphi$ for a co-safe LTL formula $\varphi = (\neg y) {\mathsf{U}} ((o {\wedge} ((\neg y) {\mathsf{U}} b)) {\vee} (b {\wedge} ((\neg y) {\mathsf{U}} o)))$.
  • Figure 2: Results for deterministic environments.
  • Figure 3: Results for noisy environments.
  • Figure 4: Results for infeasible environments.
  • Figure 5: Results of the ablation study on the sensitivity of hyperparameters $\theta$ and $N$ for updating distance-to-acceptance values in infeasible environments.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 1
  • proof