Table of Contents
Fetching ...

Oracle-Guided Masked Contrastive Reinforcement Learning for Visuomotor Policies

Yuhang Zhang, Jiaping Xiao, Chao Yan, Mir Feroskhan

TL;DR

OMC-RL tackles the sample inefficiency and sim-to-real gaps in visuomotor policy learning by decoupling representation learning from policy optimization. It uses upstream masked temporal contrastive learning with a Transformer to extract temporally-aware, task-relevant features, and downstream learning with an oracle-guided, learning-by-cheating policy that gradually reduces guidance. The approach yields faster convergence, stronger asymptotic performance, and robust generalization in both simulated and real-world drone navigation under perceptual disturbances. This framework offers practical improvements for real deployments and paves the way for extensions to multi-modal and instruction-guided robotics tasks.

Abstract

A prevailing approach for learning visuomotor policies is to employ reinforcement learning to map high-dimensional visual observations directly to action commands. However, the combination of high-dimensional visual inputs and agile maneuver outputs leads to long-standing challenges, including low sample efficiency and significant sim-to-real gaps. To address these issues, we propose Oracle-Guided Masked Contrastive Reinforcement Learning (OMC-RL), a novel framework designed to improve the sample efficiency and asymptotic performance of visuomotor policy learning. OMC-RL explicitly decouples the learning process into two stages: an upstream representation learning stage and a downstream policy learning stage. In the upstream stage, a masked Transformer module is trained with temporal modeling and contrastive learning to extract temporally-aware and task-relevant representations from sequential visual inputs. After training, the learned encoder is frozen and used to extract visual representations from consecutive frames, while the Transformer module is discarded. In the downstream stage, an oracle teacher policy with privileged access to global state information supervises the agent during early training to provide informative guidance and accelerate early policy learning. This guidance is gradually reduced to allow independent exploration as training progresses. Extensive experiments in simulated and real-world environments demonstrate that OMC-RL achieves superior sample efficiency and asymptotic policy performance, while also improving generalization across diverse and perceptually complex scenarios.

Oracle-Guided Masked Contrastive Reinforcement Learning for Visuomotor Policies

TL;DR

OMC-RL tackles the sample inefficiency and sim-to-real gaps in visuomotor policy learning by decoupling representation learning from policy optimization. It uses upstream masked temporal contrastive learning with a Transformer to extract temporally-aware, task-relevant features, and downstream learning with an oracle-guided, learning-by-cheating policy that gradually reduces guidance. The approach yields faster convergence, stronger asymptotic performance, and robust generalization in both simulated and real-world drone navigation under perceptual disturbances. This framework offers practical improvements for real deployments and paves the way for extensions to multi-modal and instruction-guided robotics tasks.

Abstract

A prevailing approach for learning visuomotor policies is to employ reinforcement learning to map high-dimensional visual observations directly to action commands. However, the combination of high-dimensional visual inputs and agile maneuver outputs leads to long-standing challenges, including low sample efficiency and significant sim-to-real gaps. To address these issues, we propose Oracle-Guided Masked Contrastive Reinforcement Learning (OMC-RL), a novel framework designed to improve the sample efficiency and asymptotic performance of visuomotor policy learning. OMC-RL explicitly decouples the learning process into two stages: an upstream representation learning stage and a downstream policy learning stage. In the upstream stage, a masked Transformer module is trained with temporal modeling and contrastive learning to extract temporally-aware and task-relevant representations from sequential visual inputs. After training, the learned encoder is frozen and used to extract visual representations from consecutive frames, while the Transformer module is discarded. In the downstream stage, an oracle teacher policy with privileged access to global state information supervises the agent during early training to provide informative guidance and accelerate early policy learning. This guidance is gradually reduced to allow independent exploration as training progresses. Extensive experiments in simulated and real-world environments demonstrate that OMC-RL achieves superior sample efficiency and asymptotic policy performance, while also improving generalization across diverse and perceptually complex scenarios.

Paper Structure

This paper contains 23 sections, 1 theorem, 14 equations, 13 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

The generalization error of masked contrastive learning is bounded by the discrepancy between the learned representation space $Z$ and the optimal representation space $Z^\star$, i.e.,

Figures (13)

  • Figure 1: Illustration of different learning paradigms for goal-conditioned representation and policy learning. (a) In vanilla teacher-student policy learning, the agent learns from an oracle teacher via policy distillation to better identify and reach the target through supervised imitation. (b) In vanilla contrastive learning, given an observation set $\mathcal{H}$, positive pairs (red arrows) formed between aggregated features are aligned, while all other negatives (blue arrows) are pushed apart. (c) Oracle-guided masked contrastive learning (ours) combines masked contrastive representation learning with oracle-supervised policy learning to jointly improve both feature encoding and downstream decision-making.
  • Figure 2: The framework of OMC-RL. (a) Upstream Masked Contrastive Representation Learning: A masked contrastive learning module is used to learn compact and task-relevant visual representations from sequential RGB inputs. The masked and original inputs are processed through CNN encoders and projection layers, while the masked branch is further processed by an auxiliary transformer module to compute the contrastive loss $\mathcal{L}_\text{cl}$. After training, the CNN encoder is frozen and the transformer module is discarded. (b) Downstream Oracle-Guided Policy Learning: An oracle teacher policy is first trained using privileged depth inputs and full-state information, providing expert action distributions for downstream supervision. This oracle network supervises the student policy through a learning-by-cheating strategy. Specifically, the agent policy is optimized via KL-divergence against the oracle policy distribution to enable efficient visuomotor policy learning.
  • Figure 3: Simulation environments with increasing complexity used to evaluate OMC-RL.
  • Figure 4: Training curves of episodic reward for all learning-based baselines. All results are averaged over three random seeds, with shaded regions indicating confidence intervals. Oracle serves as an upper bound with the fastest convergence and the highest asymptotic performance, while our method achieves comparable results and consistently outperforms all other baselines. NPE benefits from imitation of suboptimal demonstrations, outperforming CURL in both sample efficiency and asymptotic performance. PPO, as a vanilla baseline, exhibits the weakest performance.
  • Figure 5: Qualitative trajectory comparisons of baselines in three evaluation environments. The results demonstrate that the oracle and OMC-RL consistently generate smooth and efficient trajectories. In contrast, other baselines often produce suboptimal or erratic trajectories and tend to fail in environments with irregular layouts and complex textures.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof