Table of Contents
Fetching ...

Good Actions Succeed, Bad Actions Generalize: A Case Study on Why RL Generalizes Better

Meng Song

TL;DR

This work directly compares supervised learning and reinforcement learning in zero-shot generalization using the Habitat visual navigation task, evaluating PPO (RL) and BC (SL) across two zero-shot settings. It demonstrates that PPO generally generalizes better to unseen environments and start-goal combinations, while BC excels at memorizing and reproducing shortest-path patterns when trained on ample expert data. The authors argue that RL generalizes by combinatorially stitching past (including failed) trajectories, whereas BC generalizes by mimicking successful trajectories, and they show data augmentation can close SPL gaps but not the success-rate gap. They conclude with practical guidelines to enhance generalization for both paradigms and propose avenues for hybrid approaches that leverage the strengths of each.

Abstract

Supervised learning (SL) and reinforcement learning (RL) are both widely used to train general-purpose agents for complex tasks, yet their generalization capabilities and underlying mechanisms are not yet fully understood. In this paper, we provide a direct comparison between SL and RL in terms of zero-shot generalization. Using the Habitat visual navigation task as a testbed, we evaluate Proximal Policy Optimization (PPO) and Behavior Cloning (BC) agents across two levels of generalization: state-goal pair generalization within seen environments and generalization to unseen environments. Our experiments show that PPO consistently outperforms BC across both zero-shot settings and performance metrics-success rate and SPL. Interestingly, even though additional optimal training data enables BC to match PPO's zero-shot performance in SPL, it still falls significantly behind in success rate. We attribute this to a fundamental difference in how models trained by these algorithms generalize: BC-trained models generalize by imitating successful trajectories, whereas TD-based RL-trained models generalize through combinatorial experience stitching-leveraging fragments of past trajectories (mostly failed ones) to construct solutions for new tasks. This allows RL to efficiently find solutions in vast state space and discover novel strategies beyond the scope of human knowledge. Besides providing empirical evidence and understanding, we also propose practical guidelines for improving the generalization capabilities of RL and SL through algorithm design.

Good Actions Succeed, Bad Actions Generalize: A Case Study on Why RL Generalizes Better

TL;DR

This work directly compares supervised learning and reinforcement learning in zero-shot generalization using the Habitat visual navigation task, evaluating PPO (RL) and BC (SL) across two zero-shot settings. It demonstrates that PPO generally generalizes better to unseen environments and start-goal combinations, while BC excels at memorizing and reproducing shortest-path patterns when trained on ample expert data. The authors argue that RL generalizes by combinatorially stitching past (including failed) trajectories, whereas BC generalizes by mimicking successful trajectories, and they show data augmentation can close SPL gaps but not the success-rate gap. They conclude with practical guidelines to enhance generalization for both paradigms and propose avenues for hybrid approaches that leverage the strengths of each.

Abstract

Supervised learning (SL) and reinforcement learning (RL) are both widely used to train general-purpose agents for complex tasks, yet their generalization capabilities and underlying mechanisms are not yet fully understood. In this paper, we provide a direct comparison between SL and RL in terms of zero-shot generalization. Using the Habitat visual navigation task as a testbed, we evaluate Proximal Policy Optimization (PPO) and Behavior Cloning (BC) agents across two levels of generalization: state-goal pair generalization within seen environments and generalization to unseen environments. Our experiments show that PPO consistently outperforms BC across both zero-shot settings and performance metrics-success rate and SPL. Interestingly, even though additional optimal training data enables BC to match PPO's zero-shot performance in SPL, it still falls significantly behind in success rate. We attribute this to a fundamental difference in how models trained by these algorithms generalize: BC-trained models generalize by imitating successful trajectories, whereas TD-based RL-trained models generalize through combinatorial experience stitching-leveraging fragments of past trajectories (mostly failed ones) to construct solutions for new tasks. This allows RL to efficiently find solutions in vast state space and discover novel strategies beyond the scope of human knowledge. Besides providing empirical evidence and understanding, we also propose practical guidelines for improving the generalization capabilities of RL and SL through algorithm design.

Paper Structure

This paper contains 19 sections, 11 equations, 8 figures.

Figures (8)

  • Figure 1: Trial-and-error data collection: The agent is commanded to reach $g_0$ but instead reaches $g_1$ either due to random exploration or the inability to reach $g_0$. Although these trajectories fail to accomplish the training tasks, they become useful for composing skills to solve unseen tasks.
  • Figure 2: Combinatorial generalization: The agent has visited the gray and beige paths separately during training but has never seen the red path, yet it can discover it after TD learning.
  • Figure 3: Feature generalization: The agent has been trained on a large set of optimal paths between different $(s_0,g)$ pairs. When presented with an unseen $(s_0,g)$, it infers the optimal path from $s_0$ to $g$ based on common features across training samples, such as the shape of optimal paths and frequently appearing decision-informative visual elements, etc.
  • Figure 4: PPO Architecture
  • Figure 5: BC Architecture
  • ...and 3 more figures