Test-driven Reinforcement Learning in Continuous Control
Zhao Yu, Xiuping Wu, Liangjun Ke
TL;DR
This paper introduces Test-driven Reinforcement Learning (TdRL), a framework that replaces handcrafted scalar rewards with multiple test functions—pass-fail and indicative—to define task objectives as trajectory-based criteria. It proves that maximizing a trajectory return aligned with proximity to the optimal trajectory set improves policy proximity under maximum entropy reinforcement learning and adopts a lexicographic approach to learn such a return function. The TdRL algorithm iteratively collects trajectories, learns a return function from indicative signals with stability-promoting losses, decomposes it into state-action rewards, and updates the policy, demonstrated on DeepMind Control Suite tasks where TdRL matches or outperforms oracle-reward baselines while naturally handling multi-objective optimization. The work highlights a practical, interpretable direction for reward design in RL with strong theoretical grounding and broader applicability to multi-objective control tasks.
Abstract
Reinforcement learning (RL) has been recognized as a powerful tool for robot control tasks. RL typically employs reward functions to define task objectives and guide agent learning. However, since the reward function serves the dual purpose of defining the optimal goal and guiding learning, it is challenging to design the reward function manually, which often results in a suboptimal task representation. To tackle the reward design challenge in RL, inspired by the satisficing theory, we propose a Test-driven Reinforcement Learning (TdRL) framework. In the TdRL framework, multiple test functions are used to represent the task objective rather than a single reward function. Test functions can be categorized as pass-fail tests and indicative tests, each dedicated to defining the optimal objective and guiding the learning process, respectively, thereby making defining tasks easier. Building upon such a task definition, we first prove that if a trajectory return function assigns higher returns to trajectories closer to the optimal trajectory set, maximum entropy policy optimization based on this return function will yield a policy that is closer to the optimal policy set. Then, we introduce a lexicographic heuristic approach to compare the relative distance relationship between trajectories and the optimal trajectory set for learning the trajectory return function. Furthermore, we develop an algorithm implementation of TdRL. Experimental results on the DeepMind Control Suite benchmark demonstrate that TdRL matches or outperforms handcrafted reward methods in policy training, with greater design simplicity and inherent support for multi-objective optimization. We argue that TdRL offers a novel perspective for representing task objectives, which could be helpful in addressing the reward design challenges in RL applications.
