Table of Contents
Fetching ...

Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning

Yuting Tang, Xin-Qiang Cai, Jing-Cheng Pang, Qiyu Wu, Yao-Xiang Ding, Masashi Sugiyama

TL;DR

This paper introduces the problem of RL from Composite Delayed Reward (RLCoDe), which generalizes traditional RL from delayed rewards by eliminating the strong assumption, and presents a framework for modeling composite delayed rewards, using a weighted sum of non-Markovian components to capture the different contributions of individual steps.

Abstract

Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals. Unfortunately, designing high-quality instance-level rewards often demands significant effort. An emerging alternative, RL with delayed reward, focuses on learning from rewards presented periodically, which can be obtained from human evaluators assessing the agent's performance over sequences of behaviors. However, traditional methods in this domain assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards, both of which often do not align well with real-world scenarios. In this paper, we introduce the problem of RL from Composite Delayed Reward (RLCoDe), which generalizes traditional RL from delayed rewards by eliminating the strong assumption. We suggest that the delayed reward may arise from a more complex structure reflecting the overall contribution of the sequence. To address this problem, we present a framework for modeling composite delayed rewards, using a weighted sum of non-Markovian components to capture the different contributions of individual steps. Building on this framework, we propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism to effectively model these contributions. We conduct experiments on challenging locomotion tasks where the agent receives delayed rewards computed from composite functions of observable step rewards. The experimental results indicate that CoDeTr consistently outperforms baseline methods across evaluated metrics. Additionally, we demonstrate that it effectively identifies the most significant time steps within the sequence and accurately predicts rewards that closely reflect the environment feedback.

Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning

TL;DR

This paper introduces the problem of RL from Composite Delayed Reward (RLCoDe), which generalizes traditional RL from delayed rewards by eliminating the strong assumption, and presents a framework for modeling composite delayed rewards, using a weighted sum of non-Markovian components to capture the different contributions of individual steps.

Abstract

Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals. Unfortunately, designing high-quality instance-level rewards often demands significant effort. An emerging alternative, RL with delayed reward, focuses on learning from rewards presented periodically, which can be obtained from human evaluators assessing the agent's performance over sequences of behaviors. However, traditional methods in this domain assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards, both of which often do not align well with real-world scenarios. In this paper, we introduce the problem of RL from Composite Delayed Reward (RLCoDe), which generalizes traditional RL from delayed rewards by eliminating the strong assumption. We suggest that the delayed reward may arise from a more complex structure reflecting the overall contribution of the sequence. To address this problem, we present a framework for modeling composite delayed rewards, using a weighted sum of non-Markovian components to capture the different contributions of individual steps. Building on this framework, we propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism to effectively model these contributions. We conduct experiments on challenging locomotion tasks where the agent receives delayed rewards computed from composite functions of observable step rewards. The experimental results indicate that CoDeTr consistently outperforms baseline methods across evaluated metrics. Additionally, we demonstrate that it effectively identifies the most significant time steps within the sequence and accurately predicts rewards that closely reflect the environment feedback.

Paper Structure

This paper contains 34 sections, 8 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of our framework. The Composite Delayed Reward Transformer generates the predicted non-Markovian rewards $\hat{r}_t$ along with the corresponding importance weights $w_t$ for each sequence. The final composite reward for each sequence is calculated as a weighted sum of the predicted rewards.
  • Figure 2: The architecture of the proposed Composite Delayed Reward Transformer. The model processes the sequence of state-action pairs using a causal transformer, where the embeddings $x$ represent the context information from the initial time step to current time step. The in-sequence attention mechanism computes the non-Markovian rewards $\{ \hat{r} \}_{\tau}$, the queries $\{ \mathbf{q} \}_{\tau}$, and keys $\{ \mathbf{k} \}_{\tau}$, in a sequence $\tau$. The query and key vectors are multiplied and passed through a softmax operation to compute attention weights. The attention-weighted sum of instance-level rewards is then aggregated via sum pooling to generate the final sequence-level reward $\hat{R}_\mathrm{co}(\tau)$ to approximate $R_{co}(\tau)$.
  • Figure 3: A performance comparison of composite delayed rewards, SumSquare (upper), SquareSum (meddle), and Max (lower), across MuJoCo and DeepMind Control Suite environments with six different delay lengths (5, 25, 50, 100, 200, and 500). The normalized scores are averaged over 3 trials, with the mean and standard deviation computed across a total of 1e6 time steps.
  • Figure 4: Performance comparison of sum-form delayed rewards in MuJoCo and DeepMind Control Suite environments with three different delay lengths. The mean and standard deviation of the normalized scores are reported over 6 trials, spanning a total of 1e6 time steps.
  • Figure 5: Comparison of mean of observed delayed rewards (blue line), predicted rewards (green line), and learned weights (red line) in the Ant environment under two different delayed reward structures: Sum (left) and Max (right). In the Max setting, the blue points indicate the steps with the highest rewards in the original environment. Every 25 steps form a delayed reward sequence, after which a composite delayed reward is assigned. The images below correspond to the behavior of agent at the time steps highlighted by the black frames in the plots.

Theorems & Definitions (2)

  • Definition 2.1: MDP
  • Definition 2.2: CoDeMDP