A Multi-step Loss Function for Robust Learning of the Dynamics in Model-based Reinforcement Learning
Abdelhakim Benechehab, Albert Thomas, Giuseppe Paolo, Maurizio Filippone, Balázs Kégl
TL;DR
The paper tackles compounding errors in one-step dynamics learned for model-based RL by introducing a weighted multi-step loss that aggregates MSE across horizons with tunable weights and backpropagates through the full multi-step model. The authors demonstrate, in both tractable linear and nonlinear settings, that appropriate horizon weighting (often around α≈0.5) reduces variance and improves long-horizon predictive accuracy, especially under additive observation noise. Extensive experiments across Cartpole, Swimmer, and Halfcheetah datasets show improved static R2 metrics and context-dependent gains in offline MBRL, while highlighting the strong influence of hyperparameters such as horizon h and decay β on performance. The work suggests that, in realistic noisy environments, carefully designed multi-step objectives can enhance the robustness and transferability of learned dynamics, with practical impact in real-world control tasks where measurement noise is inevitable.
Abstract
In model-based reinforcement learning, most algorithms rely on simulating trajectories from one-step models of the dynamics learned on data. A critical challenge of this approach is the compounding of one-step prediction errors as the length of the trajectory grows. In this paper we tackle this issue by using a multi-step objective to train one-step models. Our objective is a weighted sum of the mean squared error (MSE) loss at various future horizons. We find that this new loss is particularly useful when the data is noisy (additive Gaussian noise in the observations), which is often the case in real-life environments. To support the multi-step loss, first we study its properties in two tractable cases: i) uni-dimensional linear system, and ii) two-parameter non-linear system. Second, we show in a variety of tasks (environments or datasets) that the models learned with this loss achieve a significant improvement in terms of the averaged R2-score on future prediction horizons. Finally, in the pure batch reinforcement learning setting, we demonstrate that one-step models serve as strong baselines when dynamics are deterministic, while multi-step models would be more advantageous in the presence of noise, highlighting the potential of our approach in real-world applications.
