MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces
Tianyu Zheng, Ge Zhang, Xingwei Qu, Ming Kuang, Stephen W. Huang, Zhaofeng He
TL;DR
MORE-3S introduces a multimodal offline RL framework that grounds states in image-derived embeddings and actions in textual prompts, aligning them within a shared semantic space to enable supervised sequence modeling. By coupling a fixed multimodal encoder (e.g., LXMERT) with a GPT-style sequence model and a memory-augmented attention mechanism, it predicts future trajectory components while conditioning on returns-to-go, improving long-horizon planning. The approach yields strong performance on Atari and OpenAI Gym benchmarks, with thorough ablations demonstrating the value of RTG conditioning, long-term memory, and pretrained components, while revealing robustness to action prompt variations. This work suggests a practical pathway for leveraging pretrained language and multimodal models to enhance offline RL efficiency and planning capabilities in diverse environments.
Abstract
Drawing upon the intuition that aligning different modalities to the same semantic embedding space would allow models to understand states and actions more easily, we propose a new perspective to the offline reinforcement learning (RL) challenge. More concretely, we transform it into a supervised learning task by integrating multimodal and pre-trained language models. Our approach incorporates state information derived from images and action-related data obtained from text, thereby bolstering RL training performance and promoting long-term strategic thinking. We emphasize the contextual understanding of language and demonstrate how decision-making in RL can benefit from aligning states' and actions' representation with languages' representation. Our method significantly outperforms current baselines as evidenced by evaluations conducted on Atari and OpenAI Gym environments. This contributes to advancing offline RL performance and efficiency while providing a novel perspective on offline RL.Our code and data are available at https://github.com/Zheng0428/MORE_.
