Adaptive Reward Design for Reinforcement Learning
Minjae Kwon, Ingy ElSayed-Aly, Lu Feng
TL;DR
This work tackles RL with complex, formally specified tasks by leveraging co-safe LTL translated into a DFA and introducing an adaptive reward shaping framework. By defining a distance-to-acceptance $d_{\varphi}(q)$ and a task progression measure $\rho_{\varphi}(q,q')$, the authors design two base reward functions and an adaptive mechanism that updates progress signals during learning, aligning policy optimization with maximal task progression $b^*$. Empirical results across discrete and continuous domains show that adaptive progression and adaptive hybrid rewards typically achieve earlier convergence, higher expected return, and greater task completion rates than baselines, while remaining compatible with algorithms like DQN, DDQN, DDPG, PPO, and A2C. The approach provides a principled path to robust RL under uncertainty with formal task specifications, and offers practical guidance on hyperparameters and applicability to a range of environments. Overall, the paper demonstrates that dynamic, task-aware reward shaping can significantly improve RL performance in LTL-specified tasks.
Abstract
There is a surge of interest in using formal languages such as Linear Temporal Logic (LTL) to precisely and succinctly specify complex tasks and derive reward functions for Reinforcement Learning (RL). However, existing methods often assign sparse rewards (e.g., giving a reward of 1 only if a task is completed and 0 otherwise). By providing feedback solely upon task completion, these methods fail to encourage successful subtask completion. This is particularly problematic in environments with inherent uncertainty, where task completion may be unreliable despite progress on intermediate goals. To address this limitation, we propose a suite of reward functions that incentivize an RL agent to complete a task specified by an LTL formula as much as possible, and develop an adaptive reward shaping approach that dynamically updates reward functions during the learning process. Experimental results on a range of benchmark RL environments demonstrate that the proposed approach generally outperforms baselines, achieving earlier convergence to a better policy with higher expected return and task completion rate.
