Effective Reward Specification in Deep Reinforcement Learning
Julien Roy
TL;DR
This work addresses the persistent challenge of reward specification in deep reinforcement learning, where improper design leads to misaligned or inefficient agents. It surveys reward composition and reward modeling, then presents four contributions: ASAF for imitation without policy optimization, coordination-promoting policy regularization in multi-agent RL, a constrained RL framework for direct behavior specification, and goal-conditioned GFlowNets for controllable multi-objective molecular design. Together, these approaches show how combining demonstrations, auxiliary objectives, explicit constraints, and goal-conditioned generation can improve sample efficiency, alignment, and controllability in complex tasks. The results illustrate both the potential and the limitations of each paradigm, emphasizing that no universal solution exists and that task-specific tool selection is essential. The work highlights practical implications for real-world RL deployment in robotics, multi-agent systems, and drug design, where intuitive interfaces for specifying objectives and constraints can reduce engineering effort and improve safety and performance.
Abstract
In the last decade, Deep Reinforcement Learning has evolved into a powerful tool for complex sequential decision-making problems. It combines deep learning's proficiency in processing rich input signals with reinforcement learning's adaptability across diverse control tasks. At its core, an RL agent seeks to maximize its cumulative reward, enabling AI algorithms to uncover novel solutions previously unknown to experts. However, this focus on reward maximization also introduces a significant difficulty: improper reward specification can result in unexpected, misaligned agent behavior and inefficient learning. The complexity of accurately specifying the reward function is further amplified by the sequential nature of the task, the sparsity of learning signals, and the multifaceted aspects of the desired behavior. In this thesis, we survey the literature on effective reward specification strategies, identify core challenges relating to each of these approaches, and propose original contributions addressing the issue of sample efficiency and alignment in deep reinforcement learning. Reward specification represents one of the most challenging aspects of applying reinforcement learning in real-world domains. Our work underscores the absence of a universal solution to this complex and nuanced challenge; solving it requires selecting the most appropriate tools for the specific requirements of each unique application.
