Table of Contents
Fetching ...

SECANT: Self-Expert Cloning for Zero-Shot Generalization of Visual Policies

Linxi Fan, Guanzhi Wang, De-An Huang, Zhiding Yu, Li Fei-Fei, Yuke Zhu, Anima Anandkumar

TL;DR

The paper tackles the challenge of zero-shot generalization in visual reinforcement learning by introducing SECANT, a two-stage self-expert cloning framework that first trains a high-performing expert with weak augmentations and then distills its behavior to a student that learns from heavily augmented observations. This decouples policy optimization from robust representation learning, enabling strong generalization to unseen visual environments without test-time rewards or adaptation. Extensive experiments across DMControl, Robosuite, CARLA, and iGibson show consistent improvements over prior SOTA, with notable gains in average rewards, robustness of representations, and faster inference than competing methods. The work provides actionable insights on augmentation strategies, imitation design, and the benefits of sequential two-stage training for robust visual policies.

Abstract

Generalization has been a long-standing challenge for reinforcement learning (RL). Visual RL, in particular, can be easily distracted by irrelevant factors in high-dimensional observation space. In this work, we consider robust policy learning which targets zero-shot generalization to unseen visual environments with large distributional shift. We propose SECANT, a novel self-expert cloning technique that leverages image augmentation in two stages to decouple robust representation learning from policy optimization. Specifically, an expert policy is first trained by RL from scratch with weak augmentations. A student network then learns to mimic the expert policy by supervised learning with strong augmentations, making its representation more robust against visual variations compared to the expert. Extensive experiments demonstrate that SECANT significantly advances the state of the art in zero-shot generalization across 4 challenging domains. Our average reward improvements over prior SOTAs are: DeepMind Control (+26.5%), robotic manipulation (+337.8%), vision-based autonomous driving (+47.7%), and indoor object navigation (+15.8%). Code release and video are available at https://linxifan.github.io/secant-site/.

SECANT: Self-Expert Cloning for Zero-Shot Generalization of Visual Policies

TL;DR

The paper tackles the challenge of zero-shot generalization in visual reinforcement learning by introducing SECANT, a two-stage self-expert cloning framework that first trains a high-performing expert with weak augmentations and then distills its behavior to a student that learns from heavily augmented observations. This decouples policy optimization from robust representation learning, enabling strong generalization to unseen visual environments without test-time rewards or adaptation. Extensive experiments across DMControl, Robosuite, CARLA, and iGibson show consistent improvements over prior SOTA, with notable gains in average rewards, robustness of representations, and faster inference than competing methods. The work provides actionable insights on augmentation strategies, imitation design, and the benefits of sequential two-stage training for robust visual policies.

Abstract

Generalization has been a long-standing challenge for reinforcement learning (RL). Visual RL, in particular, can be easily distracted by irrelevant factors in high-dimensional observation space. In this work, we consider robust policy learning which targets zero-shot generalization to unseen visual environments with large distributional shift. We propose SECANT, a novel self-expert cloning technique that leverages image augmentation in two stages to decouple robust representation learning from policy optimization. Specifically, an expert policy is first trained by RL from scratch with weak augmentations. A student network then learns to mimic the expert policy by supervised learning with strong augmentations, making its representation more robust against visual variations compared to the expert. Extensive experiments demonstrate that SECANT significantly advances the state of the art in zero-shot generalization across 4 challenging domains. Our average reward improvements over prior SOTAs are: DeepMind Control (+26.5%), robotic manipulation (+337.8%), vision-based autonomous driving (+47.7%), and indoor object navigation (+15.8%). Code release and video are available at https://linxifan.github.io/secant-site/.

Paper Structure

This paper contains 22 sections, 3 equations, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our proposed benchmark for visual policy generalization in 4 diverse domains. Top to bottom: DMControl Suite (15 settings), CARLA autonomous driving (5 weathers), Robosuite (12 settings), and iGibson indoor navigation (20 rooms).
  • Figure 2: Algorithm overview. Secant training is split into two stages. Left, stage 1: expert policy is trained by RL with weak augmentation (random cropping). Right, stage 2: student receives ground-truth action supervision from the expert at every time step, conditioned on the same observation but with strong augmentations, such as cutout-color, Gaussian noise, Mixup, and Cutmix. The student learns robust visual representations invariant to environment distractions, while maintaining high policy performance.
  • Figure 3: Ablation on different strategies to apply augmentation. "S-only" denotes single-stage policy trained with strong augmentation, and S $\rightarrow$ W means strongly-augmented expert imitated by weakly-augmented student. The recipe for Secant is W $\rightarrow$ S.
  • Figure 4: Row 1 and 2: saliency map of the learned policies in unseen tests. Secant attends to the components crucial to the task, while other agents often focus on irrelevant places. Row 3: t-SNE visualization of state embeddings. Our method correctly groups semantically similar states with different visual appearances.
  • Figure 5: Secant vs PAD inference latency. Y-axis denotes average seconds per action (log-scale). Secant improves inference speed by an order of magnitude compared to PAD.
  • ...and 1 more figures