Language Reward Modulation for Pretraining Reinforcement Learning
Ademi Adeniji, Amber Xie, Carmelo Sferrazza, Younggyo Seo, Stephen James, Pieter Abbeel
TL;DR
The paper addresses the reward design bottleneck in reinforcement learning by proposing LAMP, a pretraining framework that uses a frozen vision-language model to generate language-conditioned intrinsic rewards for RL pretraining in a diverse, visually rich environment. By combining R3M-based rewards with Plan2Explore, LAMP biases exploration toward semantically meaningful affordances and yields a language-conditioned policy that can be finetuned efficiently on downstream tasks. Across RLBench tasks, LAMP improves sample efficiency and shows robustness to different language prompts and VLMs, highlighting the potential of VLMs as scalable pretraining signals for robotics. Limitations include the reliance on VLM inference speed and the need to extend to longer-horizon sequences; future work may explore faster VLMs and broader language-conditioned pretraining horizons.
Abstract
Using learned reward functions (LRFs) as a means to solve sparse-reward reinforcement learning (RL) tasks has yielded some steady progress in task-complexity through the years. In this work, we question whether today's LRFs are best-suited as a direct replacement for task rewards. Instead, we propose leveraging the capabilities of LRFs as a pretraining signal for RL. Concretely, we propose $\textbf{LA}$nguage Reward $\textbf{M}$odulated $\textbf{P}$retraining (LAMP) which leverages the zero-shot capabilities of Vision-Language Models (VLMs) as a $\textit{pretraining}$ utility for RL as opposed to a downstream task reward. LAMP uses a frozen, pretrained VLM to scalably generate noisy, albeit shaped exploration rewards by computing the contrastive alignment between a highly diverse collection of language instructions and the image observations of an agent in its pretraining environment. LAMP optimizes these rewards in conjunction with standard novelty-seeking exploration rewards with reinforcement learning to acquire a language-conditioned, pretrained policy. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks in RLBench.
