Table of Contents
Fetching ...

A Unified View on Solving Objective Mismatch in Model-Based Reinforcement Learning

Ran Wei, Nathan Lambert, Anthony McDonald, Alfredo Garcia, Roberto Calandra

TL;DR

This paper addresses objective mismatch in model-based RL by surveying approaches that align model learning with policy optimization. It introduces a taxonomy of decision-aware MBRL consisting of Distribution Correction, Control-As-Inference, Value-Equivalence, and Differentiable Planning, and argues for the central principle of value optimization-equivalence, where both model and policy are trained to maximize the real-world return $J_{M}(\pi) = \mathbb{E}_{P(\tau)}[R(\tau)]$. By synthesizing 46 papers, the work analyzes intuition, implementations, and evaluations, and discusses design choices, agent properties, optimization methods, and downstream applications. The review emphasizes implications for data efficiency, safety, transparency, and transfer, and outlines future directions including rigorous evaluation protocols and multi-task settings. Overall, aligning model and policy objectives under principled decision-aware objectives is presented as a path to more capable, robust, and interpretable MBRL systems.

Abstract

Model-based Reinforcement Learning (MBRL) aims to make agents more sample-efficient, adaptive, and explainable by learning an explicit model of the environment. While the capabilities of MBRL agents have significantly improved in recent years, how to best learn the model is still an unresolved question. The majority of MBRL algorithms aim at training the model to make accurate predictions about the environment and subsequently using the model to determine the most rewarding actions. However, recent research has shown that model predictive accuracy is often not correlated with action quality, tracing the root cause to the objective mismatch between accurate dynamics model learning and policy optimization of rewards. A number of interrelated solution categories to the objective mismatch problem have emerged as MBRL continues to mature as a research area. In this work, we provide an in-depth survey of these solution categories and propose a taxonomy to foster future research.

A Unified View on Solving Objective Mismatch in Model-Based Reinforcement Learning

TL;DR

This paper addresses objective mismatch in model-based RL by surveying approaches that align model learning with policy optimization. It introduces a taxonomy of decision-aware MBRL consisting of Distribution Correction, Control-As-Inference, Value-Equivalence, and Differentiable Planning, and argues for the central principle of value optimization-equivalence, where both model and policy are trained to maximize the real-world return . By synthesizing 46 papers, the work analyzes intuition, implementations, and evaluations, and discusses design choices, agent properties, optimization methods, and downstream applications. The review emphasizes implications for data efficiency, safety, transparency, and transfer, and outlines future directions including rigorous evaluation protocols and multi-task settings. Overall, aligning model and policy objectives under principled decision-aware objectives is presented as a path to more capable, robust, and interpretable MBRL systems.

Abstract

Model-based Reinforcement Learning (MBRL) aims to make agents more sample-efficient, adaptive, and explainable by learning an explicit model of the environment. While the capabilities of MBRL agents have significantly improved in recent years, how to best learn the model is still an unresolved question. The majority of MBRL algorithms aim at training the model to make accurate predictions about the environment and subsequently using the model to determine the most rewarding actions. However, recent research has shown that model predictive accuracy is often not correlated with action quality, tracing the root cause to the objective mismatch between accurate dynamics model learning and policy optimization of rewards. A number of interrelated solution categories to the objective mismatch problem have emerged as MBRL continues to mature as a research area. In this work, we provide an in-depth survey of these solution categories and propose a taxonomy to foster future research.
Paper Structure (23 sections, 1 theorem, 42 equations, 1 figure, 1 algorithm)

This paper contains 23 sections, 1 theorem, 42 equations, 1 figure, 1 algorithm.

Key Result

Theorem 2.1

(Lemma 3 in xu2020error) Given an MDP with bounded reward: $\max_{s, a}|R(s, a)| = R_{max}$ and dynamics $M$, a data-collecting behavior policy $\pi_{b}$, and a learned model $\hat{M}$ with $\mathbb{E}_{(s, a) \sim d_{M}^{\pi_{b}}}D_{KL}[M(\cdot|s, a) || \hat{M}(\cdot|s, a)] \leq \epsilon_{\hat{M}}$

Figures (1)

  • Figure 1: Schematic of the relationships between the core surveyed decision-aware MBRL approaches. Direct relationships are shown in solid arrows. Indirect relationships are shown in dashed connections. The algorithms in each category are sorted by the order in which they are presented in the paper.

Theorems & Definitions (1)

  • Theorem 2.1