Table of Contents
Fetching ...

Transductive Reward Inference on Graph

Bohao Qu, Xiaofeng Cao, Qing Guo, Yi Chang, Ivor W. Tsang, Chengqi Zhang

TL;DR

The paper addresses offline reinforcement learning with scarce reward annotations by introducing TRAIN, a transductive reward inference method on a reward propagation graph. It models each state–action pair as a graph node and learns multi-factor edge weights that reflect how different state and action components influence rewards, then infers rewards for unlabelled pairs via a fixed-point propagation: $R_U = (I - W_{UU})^{-1} W_{UL} R_L$. The approach is validated on locomotion and robotic manipulation tasks from Meta-World and the DeepMind Control Suite, showing improved offline RL performance and reward smoothness. This work enables efficient policy learning in settings where reward functions are difficult to obtain, with practical impact for robotics, healthcare, and other ethically constrained domains.

Abstract

In this study, we present a transductive inference approach on that reward information propagation graph, which enables the effective estimation of rewards for unlabelled data in offline reinforcement learning. Reward inference is the key to learning effective policies in practical scenarios, while direct environmental interactions are either too costly or unethical and the reward functions are rarely accessible, such as in healthcare and robotics. Our research focuses on developing a reward inference method based on the contextual properties of information propagation on graphs that capitalizes on a constrained number of human reward annotations to infer rewards for unlabelled data. We leverage both the available data and limited reward annotations to construct a reward propagation graph, wherein the edge weights incorporate various influential factors pertaining to the rewards. Subsequently, we employ the constructed graph for transductive reward inference, thereby estimating rewards for unlabelled data. Furthermore, we establish the existence of a fixed point during several iterations of the transductive inference process and demonstrate its at least convergence to a local optimum. Empirical evaluations on locomotion and robotic manipulation tasks validate the effectiveness of our approach. The application of our inferred rewards improves the performance in offline reinforcement learning tasks.

Transductive Reward Inference on Graph

TL;DR

The paper addresses offline reinforcement learning with scarce reward annotations by introducing TRAIN, a transductive reward inference method on a reward propagation graph. It models each state–action pair as a graph node and learns multi-factor edge weights that reflect how different state and action components influence rewards, then infers rewards for unlabelled pairs via a fixed-point propagation: . The approach is validated on locomotion and robotic manipulation tasks from Meta-World and the DeepMind Control Suite, showing improved offline RL performance and reward smoothness. This work enables efficient policy learning in settings where reward functions are difficult to obtain, with practical impact for robotics, healthcare, and other ethically constrained domains.

Abstract

In this study, we present a transductive inference approach on that reward information propagation graph, which enables the effective estimation of rewards for unlabelled data in offline reinforcement learning. Reward inference is the key to learning effective policies in practical scenarios, while direct environmental interactions are either too costly or unethical and the reward functions are rarely accessible, such as in healthcare and robotics. Our research focuses on developing a reward inference method based on the contextual properties of information propagation on graphs that capitalizes on a constrained number of human reward annotations to infer rewards for unlabelled data. We leverage both the available data and limited reward annotations to construct a reward propagation graph, wherein the edge weights incorporate various influential factors pertaining to the rewards. Subsequently, we employ the constructed graph for transductive reward inference, thereby estimating rewards for unlabelled data. Furthermore, we establish the existence of a fixed point during several iterations of the transductive inference process and demonstrate its at least convergence to a local optimum. Empirical evaluations on locomotion and robotic manipulation tasks validate the effectiveness of our approach. The application of our inferred rewards improves the performance in offline reinforcement learning tasks.
Paper Structure (33 sections, 19 equations, 10 figures, 9 tables)

This paper contains 33 sections, 19 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: We first represent each state-action pair within a Markov decision process (MDP) as an individual graph node. Then, we establish a foundation for modeling the state-action sequences across multiple MDPs as interconnected chains. Finally, these chains collectively form a comprehensive graph that encapsulates the dynamics of multiple MDPs. The graph structure is characterized by connectivity, where each node is connected to multiple other nodes. We leverage this feature to propagate reward-related information within the graph.
  • Figure 2: TRAIN workflow: The process begins with the construction of a reward propagation graph using a pre-recorded dataset. Subsequently, this graph, in conjunction with state-action pairs that have rewards, is utilized to infer rewards for state-action pairs that lack rewards. In the final step, all state-action pairs, both with and without inferred rewards, are integrated into the offline reinforcement learning process.
  • Figure 3: Meta-World is a set of robotic manipulation tasks.
  • Figure 4: DeepMind Control Suite is a set of popular continuous control environments with tasks of varying difficulty, including locomotion and simple object manipulation.
  • Figure 5: Learning curves on the four Meta-World tasks as measured on the success rate. The solid line and shaded regions represent the mean and standard deviation, respectively, across five seeds.
  • ...and 5 more figures

Theorems & Definitions (1)

  • proof