MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces

Tianyu Zheng; Ge Zhang; Xingwei Qu; Ming Kuang; Stephen W. Huang; Zhaofeng He

MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces

Tianyu Zheng, Ge Zhang, Xingwei Qu, Ming Kuang, Stephen W. Huang, Zhaofeng He

TL;DR

MORE-3S introduces a multimodal offline RL framework that grounds states in image-derived embeddings and actions in textual prompts, aligning them within a shared semantic space to enable supervised sequence modeling. By coupling a fixed multimodal encoder (e.g., LXMERT) with a GPT-style sequence model and a memory-augmented attention mechanism, it predicts future trajectory components while conditioning on returns-to-go, improving long-horizon planning. The approach yields strong performance on Atari and OpenAI Gym benchmarks, with thorough ablations demonstrating the value of RTG conditioning, long-term memory, and pretrained components, while revealing robustness to action prompt variations. This work suggests a practical pathway for leveraging pretrained language and multimodal models to enhance offline RL efficiency and planning capabilities in diverse environments.

Abstract

Drawing upon the intuition that aligning different modalities to the same semantic embedding space would allow models to understand states and actions more easily, we propose a new perspective to the offline reinforcement learning (RL) challenge. More concretely, we transform it into a supervised learning task by integrating multimodal and pre-trained language models. Our approach incorporates state information derived from images and action-related data obtained from text, thereby bolstering RL training performance and promoting long-term strategic thinking. We emphasize the contextual understanding of language and demonstrate how decision-making in RL can benefit from aligning states' and actions' representation with languages' representation. Our method significantly outperforms current baselines as evidenced by evaluations conducted on Atari and OpenAI Gym environments. This contributes to advancing offline RL performance and efficiency while providing a novel perspective on offline RL.Our code and data are available at https://github.com/Zheng0428/MORE_.

MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces

TL;DR

Abstract

Paper Structure (19 sections, 7 equations, 3 figures, 11 tables)

This paper contains 19 sections, 7 equations, 3 figures, 11 tables.

Introduction
Related Work
Preliminary
Architecture and Training
Model Architecture
Memory Mechanism
Training Details
Procedure of Training
Experiments
Experimental Setup
Baselines
Atari
OpenAI Gym
Discussion
Conclusion
...and 4 more sections

Figures (3)

Figure 1: Architecture diagram of the proposed MORE-3S approach. The Multimodal Encoder component combines the action (text) and state (image) inputs using the LXMERT model. "Embed." denotes the embedding process. Autoregressive modeling of trajectories captures the system's dynamics by modeling trajectories as a sequence of tuples. LPMs predict subsequent actions based on the encoded sequence $O_t$, which corresponds to the 'Mixed Embed.' section in the diagram.
Figure 2: Schematic Representation of the Integration of Return-to-Go (RTG) Quantity and Memory Mechanism in GPT-style Attention Architecture.
Figure 3: Experiment on randomizing model weights versus finetuning them on OpenAI Gym.

MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces

TL;DR

Abstract

MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces

Authors

TL;DR

Abstract

Table of Contents

Figures (3)