Table of Contents
Fetching ...

Fixing That Free Lunch: When, Where, and Why Synthetic Data Fails in Model-Based Policy Optimization

Brett Barkley, David Fridovich-Keil

TL;DR

This paper analyzes when synthetic data helps Model-Based Policy Optimization (MBPO) and why it fails in the DeepMind Control Suite. It identifies two coupled failure modes—scale mismatches between dynamics and reward targets causing critic underestimation, and residual-target variance inflation—then introduces Fixing That Free Lunch (FTFL), combining Target Unit Normalization and Direct Next-State Prediction to restore learning. FTFL achieves substantial gains, outperforming SAC on five of seven DMC tasks while preserving Gym performance, and Tuned FTFL further boosts results with larger model capacity. The work highlights the importance of task–algorithm mappings and taxonomy-driven remedies for robust RL, illustrating that benchmark choices shape generalization and emphasizing practical deployments beyond aggregate metrics.

Abstract

Synthetic data is a core component of data-efficient Dyna-style model-based reinforcement learning, yet it can also degrade performance. We study when it helps, where it fails, and why, and we show that addressing the resulting failure modes enables policy improvement that was previously unattainable. We focus on Model-Based Policy Optimization (MBPO), which performs actor and critic updates using synthetic action counterfactuals. Despite reports of strong and generalizable sample-efficiency gains in OpenAI Gym, recent work shows that MBPO often underperforms its model-free counterpart, Soft Actor-Critic (SAC), in the DeepMind Control Suite (DMC). Although both suites involve continuous control with proprioceptive robots, this shift leads to sharp performance losses across seven challenging DMC tasks, with MBPO failing in cases where claims of generalization from Gym would imply success. This reveals how environment-specific assumptions can become implicitly encoded into algorithm design when evaluation is limited. We identify two coupled issues behind these failures: scale mismatches between dynamics and reward models that induce critic underestimation and hinder policy improvement during model-policy coevolution, and a poor choice of target representation that inflates model variance and produces error-prone rollouts. Addressing these failure modes enables policy improvement where none was previously possible, allowing MBPO to outperform SAC in five of seven tasks while preserving the strong performance previously reported in OpenAI Gym. Rather than aiming only for incremental average gains, we hope our findings motivate the community to develop taxonomies that tie MDP task- and environment-level structure to algorithmic failure modes, pursue unified solutions where possible, and clarify how benchmark choices ultimately shape the conditions under which algorithms generalize.

Fixing That Free Lunch: When, Where, and Why Synthetic Data Fails in Model-Based Policy Optimization

TL;DR

This paper analyzes when synthetic data helps Model-Based Policy Optimization (MBPO) and why it fails in the DeepMind Control Suite. It identifies two coupled failure modes—scale mismatches between dynamics and reward targets causing critic underestimation, and residual-target variance inflation—then introduces Fixing That Free Lunch (FTFL), combining Target Unit Normalization and Direct Next-State Prediction to restore learning. FTFL achieves substantial gains, outperforming SAC on five of seven DMC tasks while preserving Gym performance, and Tuned FTFL further boosts results with larger model capacity. The work highlights the importance of task–algorithm mappings and taxonomy-driven remedies for robust RL, illustrating that benchmark choices shape generalization and emphasizing practical deployments beyond aggregate metrics.

Abstract

Synthetic data is a core component of data-efficient Dyna-style model-based reinforcement learning, yet it can also degrade performance. We study when it helps, where it fails, and why, and we show that addressing the resulting failure modes enables policy improvement that was previously unattainable. We focus on Model-Based Policy Optimization (MBPO), which performs actor and critic updates using synthetic action counterfactuals. Despite reports of strong and generalizable sample-efficiency gains in OpenAI Gym, recent work shows that MBPO often underperforms its model-free counterpart, Soft Actor-Critic (SAC), in the DeepMind Control Suite (DMC). Although both suites involve continuous control with proprioceptive robots, this shift leads to sharp performance losses across seven challenging DMC tasks, with MBPO failing in cases where claims of generalization from Gym would imply success. This reveals how environment-specific assumptions can become implicitly encoded into algorithm design when evaluation is limited. We identify two coupled issues behind these failures: scale mismatches between dynamics and reward models that induce critic underestimation and hinder policy improvement during model-policy coevolution, and a poor choice of target representation that inflates model variance and produces error-prone rollouts. Addressing these failure modes enables policy improvement where none was previously possible, allowing MBPO to outperform SAC in five of seven tasks while preserving the strong performance previously reported in OpenAI Gym. Rather than aiming only for incremental average gains, we hope our findings motivate the community to develop taxonomies that tie MDP task- and environment-level structure to algorithmic failure modes, pursue unified solutions where possible, and clarify how benchmark choices ultimately shape the conditions under which algorithms generalize.

Paper Structure

This paper contains 21 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Per-environment interquartile mean (IQM) returns normalized to SAC in DeepMind Control Suite (DMC) and OpenAI Gym across 6 seeds per task. SAC is the baseline ($100\%$), with values above shaded green (improvement) and below shaded red (underperformance). MBPO matches or exceeds SAC only in Gym, failing on all DMC tasks. Fixing That Free Lunch (FTFL) and Tuned FTFL outperform SAC on all DMC tasks except hopper-hop and hopper-stand, with large gains in most cases. Task circle sizes are reduced left to right for visibility of overlaps.
  • Figure 2: Poor scaling between reward and next-state predictions in MBPO’s model leads to underestimated rewards, Q-value underestimation, and failure to improve beyond a random policy in humanoid-stand. (\ref{['fig:humanoid-stand-a']}) Real vs synthetic rewards with and without proper model target scaling, using the best performing SAC replay buffer (1 seed). (\ref{['fig:humanoid-stand-b']}) Critic estimates when MBPO is deployed online, 3 seeds (mean ± std). (\ref{['fig:humanoid-stand-c']}) Returns when MBPO is deployed online, 3 seeds (mean ± std).
  • Figure 3: (\ref{['fig:walker-vs-human-stand']}) Variance loss comparison between walker2d and humanoid-stand, showing substantially higher variance in the latter (1 seed). (\ref{['fig:humanoid-obs-model']}) Predicted variance under residual versus direct next-state modeling, where direct prediction yields lower variance and a more stable fit (1 seed). (\ref{['fig:humanoid-return-ablations']}) Online returns for three ablation settings on humanoid-stand, demonstrating that only combining target normalization with direct prediction enables consistent policy improvement (3 seeds, mean ± std).
  • Figure 4: Return performance on six DMC tasks. Following agarwal2022deepreinforcementlearningedge, solid lines show interquartile mean (IQM) returns aggregated across six seeds, and shaded regions denote 95% bootstrapped confidence intervals. Results are shown for FTFL with the original model and for Tuned FTFL with increased model capacity (effective only on humanoid tasks). One result was omitted for brevity, but raw averaged returns are provided for all seven tasks in \ref{['sec:rawreturns']}.
  • Figure 5: Return performance on six OpenAI Gym tasks, using the same IQM and confidence interval conventions as \ref{['fig:dmc']}. Raw averaged returns are provided in \ref{['sec:rawreturns']}.
  • ...and 4 more figures