MOOSS: Mask-Enhanced Temporal Contrastive Learning for Smooth State Evolution in Visual Reinforcement Learning
Jiarui Sun, M. Ugur Akcal, Wei Zhang, Girish Chowdhary
TL;DR
MOOSS addresses the challenge of sample efficiency in visual reinforcement learning by explicitly modeling state evolution through a graph-based spatial-temporal masking regime and a multi-level temporal contrastive objective. It employs dual encoders (one momentum) and a Transformer-based predictive decoder to generate query states from masked inputs, while a temporal contrastive loss enforces smooth, hierarchical state similarities across time. The approach yields significant improvements on DMControl and Atari benchmarks, outperforming strong baselines and providing thorough ablations that highlight the benefits of both masking and multi-level contrastive learning. The method advances state representation learning in visual RL and offers practical gains in sample efficiency with open-source code available for reproducibility and further exploration.
Abstract
In visual Reinforcement Learning (RL), learning from pixel-based observations poses significant challenges on sample efficiency, primarily due to the complexity of extracting informative state representations from high-dimensional data. Previous methods such as contrastive-based approaches have made strides in improving sample efficiency but fall short in modeling the nuanced evolution of states. To address this, we introduce MOOSS, a novel framework that leverages a temporal contrastive objective with the help of graph-based spatial-temporal masking to explicitly model state evolution in visual RL. Specifically, we propose a self-supervised dual-component strategy that integrates (1) a graph construction of pixel-based observations for spatial-temporal masking, coupled with (2) a multi-level contrastive learning mechanism that enriches state representations by emphasizing temporal continuity and change of states. MOOSS advances the understanding of state dynamics by disrupting and learning from spatial-temporal correlations, which facilitates policy learning. Our comprehensive evaluation on multiple continuous and discrete control benchmarks shows that MOOSS outperforms previous state-of-the-art visual RL methods in terms of sample efficiency, demonstrating the effectiveness of our method. Our code is released at https://github.com/jsun57/MOOSS.
