Transductive Reward Inference on Graph
Bohao Qu, Xiaofeng Cao, Qing Guo, Yi Chang, Ivor W. Tsang, Chengqi Zhang
TL;DR
The paper addresses offline reinforcement learning with scarce reward annotations by introducing TRAIN, a transductive reward inference method on a reward propagation graph. It models each state–action pair as a graph node and learns multi-factor edge weights that reflect how different state and action components influence rewards, then infers rewards for unlabelled pairs via a fixed-point propagation: $R_U = (I - W_{UU})^{-1} W_{UL} R_L$. The approach is validated on locomotion and robotic manipulation tasks from Meta-World and the DeepMind Control Suite, showing improved offline RL performance and reward smoothness. This work enables efficient policy learning in settings where reward functions are difficult to obtain, with practical impact for robotics, healthcare, and other ethically constrained domains.
Abstract
In this study, we present a transductive inference approach on that reward information propagation graph, which enables the effective estimation of rewards for unlabelled data in offline reinforcement learning. Reward inference is the key to learning effective policies in practical scenarios, while direct environmental interactions are either too costly or unethical and the reward functions are rarely accessible, such as in healthcare and robotics. Our research focuses on developing a reward inference method based on the contextual properties of information propagation on graphs that capitalizes on a constrained number of human reward annotations to infer rewards for unlabelled data. We leverage both the available data and limited reward annotations to construct a reward propagation graph, wherein the edge weights incorporate various influential factors pertaining to the rewards. Subsequently, we employ the constructed graph for transductive reward inference, thereby estimating rewards for unlabelled data. Furthermore, we establish the existence of a fixed point during several iterations of the transductive inference process and demonstrate its at least convergence to a local optimum. Empirical evaluations on locomotion and robotic manipulation tasks validate the effectiveness of our approach. The application of our inferred rewards improves the performance in offline reinforcement learning tasks.
