Table of Contents
Fetching ...

A Multi-step Loss Function for Robust Learning of the Dynamics in Model-based Reinforcement Learning

Abdelhakim Benechehab, Albert Thomas, Giuseppe Paolo, Maurizio Filippone, Balázs Kégl

TL;DR

The paper tackles compounding errors in one-step dynamics learned for model-based RL by introducing a weighted multi-step loss that aggregates MSE across horizons with tunable weights and backpropagates through the full multi-step model. The authors demonstrate, in both tractable linear and nonlinear settings, that appropriate horizon weighting (often around α≈0.5) reduces variance and improves long-horizon predictive accuracy, especially under additive observation noise. Extensive experiments across Cartpole, Swimmer, and Halfcheetah datasets show improved static R2 metrics and context-dependent gains in offline MBRL, while highlighting the strong influence of hyperparameters such as horizon h and decay β on performance. The work suggests that, in realistic noisy environments, carefully designed multi-step objectives can enhance the robustness and transferability of learned dynamics, with practical impact in real-world control tasks where measurement noise is inevitable.

Abstract

In model-based reinforcement learning, most algorithms rely on simulating trajectories from one-step models of the dynamics learned on data. A critical challenge of this approach is the compounding of one-step prediction errors as the length of the trajectory grows. In this paper we tackle this issue by using a multi-step objective to train one-step models. Our objective is a weighted sum of the mean squared error (MSE) loss at various future horizons. We find that this new loss is particularly useful when the data is noisy (additive Gaussian noise in the observations), which is often the case in real-life environments. To support the multi-step loss, first we study its properties in two tractable cases: i) uni-dimensional linear system, and ii) two-parameter non-linear system. Second, we show in a variety of tasks (environments or datasets) that the models learned with this loss achieve a significant improvement in terms of the averaged R2-score on future prediction horizons. Finally, in the pure batch reinforcement learning setting, we demonstrate that one-step models serve as strong baselines when dynamics are deterministic, while multi-step models would be more advantageous in the presence of noise, highlighting the potential of our approach in real-world applications.

A Multi-step Loss Function for Robust Learning of the Dynamics in Model-based Reinforcement Learning

TL;DR

The paper tackles compounding errors in one-step dynamics learned for model-based RL by introducing a weighted multi-step loss that aggregates MSE across horizons with tunable weights and backpropagates through the full multi-step model. The authors demonstrate, in both tractable linear and nonlinear settings, that appropriate horizon weighting (often around α≈0.5) reduces variance and improves long-horizon predictive accuracy, especially under additive observation noise. Extensive experiments across Cartpole, Swimmer, and Halfcheetah datasets show improved static R2 metrics and context-dependent gains in offline MBRL, while highlighting the strong influence of hyperparameters such as horizon h and decay β on performance. The work suggests that, in realistic noisy environments, carefully designed multi-step objectives can enhance the robustness and transferability of learned dynamics, with practical impact in real-world control tasks where measurement noise is inevitable.

Abstract

In model-based reinforcement learning, most algorithms rely on simulating trajectories from one-step models of the dynamics learned on data. A critical challenge of this approach is the compounding of one-step prediction errors as the length of the trajectory grows. In this paper we tackle this issue by using a multi-step objective to train one-step models. Our objective is a weighted sum of the mean squared error (MSE) loss at various future horizons. We find that this new loss is particularly useful when the data is noisy (additive Gaussian noise in the observations), which is often the case in real-life environments. To support the multi-step loss, first we study its properties in two tractable cases: i) uni-dimensional linear system, and ii) two-parameter non-linear system. Second, we show in a variety of tasks (environments or datasets) that the models learned with this loss achieve a significant improvement in terms of the averaged R2-score on future prediction horizons. Finally, in the pure batch reinforcement learning setting, we demonstrate that one-step models serve as strong baselines when dynamics are deterministic, while multi-step models would be more advantageous in the presence of noise, highlighting the potential of our approach in real-world applications.
Paper Structure (35 sections, 3 theorems, 16 equations, 17 figures, 5 tables, 1 algorithm)

This paper contains 35 sections, 3 theorems, 16 equations, 17 figures, 5 tables, 1 algorithm.

Key Result

Proposition 5.3

($\alpha = 1$). Given a transition $(s_t \neq 0, o_{t+1})$ from the linear system and a linear model with parameter $\theta$, the minimizer of the $\alpha=1$ multi-step loss can be computed as:

Figures (17)

  • Figure 1: Schematic representation of the multi-step prediction framework using a one-step predictive model $\hat{p}$. The diagram illustrates the iterative prediction of future states $\hat{s}_{t+i}$, the computation of per-horizon losses $L_i$ against real system states $s_{t+1}$, and the weighting of these losses $\alpha_i$ to optimize the predictive model over a horizon of $h$ steps.
  • Figure 2: The loss function and its derivative for different values of $\theta$ and $\alpha$, in absence of noise ($\sigma=0$). In this figure, $\theta_{true}$ is fixed to a randomly selected value, $\theta_{true}=0.78$. The roots of the derivative are highlighted with stars.
  • Figure 3: The distance between the true parameters and the optimal parameters for different values of $\alpha$ and noise scales.
  • Figure 4: The left panel shows the density distribution of $\hat{\theta} - \theta_{true}$ for a fixed $\sigma$ of 1.0. The middle panel delineates the bias of the estimator, defined as $E[\hat{\theta}] - \theta_{true}$, across varying levels of $\sigma$, and weights $\alpha \in \{0, 0.5, 1\}$ indicated by color. The right panel presents the variance of the estimator, $Var[\hat{\theta}]$, as a function of $\sigma$ for the same set of $\alpha$ values. The shaded regions represent the 95% bootstrap confidence intervals across ten $\theta_{true}$ values and 100 Monte Carlo simulations.
  • Figure 5: The validation one-step MSE $L_1$ (in yellow), the validation two-step MSE $L_0$ (in green) and the average of these two MSEs (dashed black line) for different values of $\alpha$. The error bars represent the 95% bootstrap confidence intervals across 2 optimizers, 3 initialization distributions, 10 initial points, 3 noise levels, and 10 Monte Carlo simulations.
  • ...and 12 more figures

Theorems & Definitions (11)

  • Definition 3.1
  • Definition 3.2
  • Definition 5.1
  • Definition 5.2
  • Proposition 5.3
  • Proposition 5.4
  • Remark 5.5
  • Definition 5.6
  • Definition 5.7
  • Proposition 3.1
  • ...and 1 more