Table of Contents
Fetching ...

MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces

Tianyu Zheng, Ge Zhang, Xingwei Qu, Ming Kuang, Stephen W. Huang, Zhaofeng He

TL;DR

MORE-3S introduces a multimodal offline RL framework that grounds states in image-derived embeddings and actions in textual prompts, aligning them within a shared semantic space to enable supervised sequence modeling. By coupling a fixed multimodal encoder (e.g., LXMERT) with a GPT-style sequence model and a memory-augmented attention mechanism, it predicts future trajectory components while conditioning on returns-to-go, improving long-horizon planning. The approach yields strong performance on Atari and OpenAI Gym benchmarks, with thorough ablations demonstrating the value of RTG conditioning, long-term memory, and pretrained components, while revealing robustness to action prompt variations. This work suggests a practical pathway for leveraging pretrained language and multimodal models to enhance offline RL efficiency and planning capabilities in diverse environments.

Abstract

Drawing upon the intuition that aligning different modalities to the same semantic embedding space would allow models to understand states and actions more easily, we propose a new perspective to the offline reinforcement learning (RL) challenge. More concretely, we transform it into a supervised learning task by integrating multimodal and pre-trained language models. Our approach incorporates state information derived from images and action-related data obtained from text, thereby bolstering RL training performance and promoting long-term strategic thinking. We emphasize the contextual understanding of language and demonstrate how decision-making in RL can benefit from aligning states' and actions' representation with languages' representation. Our method significantly outperforms current baselines as evidenced by evaluations conducted on Atari and OpenAI Gym environments. This contributes to advancing offline RL performance and efficiency while providing a novel perspective on offline RL.Our code and data are available at https://github.com/Zheng0428/MORE_.

MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces

TL;DR

MORE-3S introduces a multimodal offline RL framework that grounds states in image-derived embeddings and actions in textual prompts, aligning them within a shared semantic space to enable supervised sequence modeling. By coupling a fixed multimodal encoder (e.g., LXMERT) with a GPT-style sequence model and a memory-augmented attention mechanism, it predicts future trajectory components while conditioning on returns-to-go, improving long-horizon planning. The approach yields strong performance on Atari and OpenAI Gym benchmarks, with thorough ablations demonstrating the value of RTG conditioning, long-term memory, and pretrained components, while revealing robustness to action prompt variations. This work suggests a practical pathway for leveraging pretrained language and multimodal models to enhance offline RL efficiency and planning capabilities in diverse environments.

Abstract

Drawing upon the intuition that aligning different modalities to the same semantic embedding space would allow models to understand states and actions more easily, we propose a new perspective to the offline reinforcement learning (RL) challenge. More concretely, we transform it into a supervised learning task by integrating multimodal and pre-trained language models. Our approach incorporates state information derived from images and action-related data obtained from text, thereby bolstering RL training performance and promoting long-term strategic thinking. We emphasize the contextual understanding of language and demonstrate how decision-making in RL can benefit from aligning states' and actions' representation with languages' representation. Our method significantly outperforms current baselines as evidenced by evaluations conducted on Atari and OpenAI Gym environments. This contributes to advancing offline RL performance and efficiency while providing a novel perspective on offline RL.Our code and data are available at https://github.com/Zheng0428/MORE_.
Paper Structure (19 sections, 7 equations, 3 figures, 11 tables)

This paper contains 19 sections, 7 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Architecture diagram of the proposed MORE-3S approach. The Multimodal Encoder component combines the action (text) and state (image) inputs using the LXMERT model. "Embed." denotes the embedding process. Autoregressive modeling of trajectories captures the system's dynamics by modeling trajectories as a sequence of tuples. LPMs predict subsequent actions based on the encoded sequence $O_t$, which corresponds to the 'Mixed Embed.' section in the diagram.
  • Figure 2: Schematic Representation of the Integration of Return-to-Go (RTG) Quantity and Memory Mechanism in GPT-style Attention Architecture.
  • Figure 3: Experiment on randomizing model weights versus finetuning them on OpenAI Gym.