Table of Contents
Fetching ...

Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction

Yonggang Jin, Ge Zhang, Hao Zhao, Tianyu Zheng, Jarvi Guo, Liuyu Xiang, Shawn Yue, Stephen W. Huang, Zhaofeng He, Jie Fu

TL;DR

Enhanced forms of task guidance for agents are explored, enabling them to comprehend gameplay instructions, thereby facilitating a "read-to-play"capability and demonstrating that incorporating multimodal game instructions significantly enhances the decision transformer's multitasking and generalization capabilities.

Abstract

Developing a generalist agent is a longstanding objective in artificial intelligence. Previous efforts utilizing extensive offline datasets from various tasks demonstrate remarkable performance in multitasking scenarios within Reinforcement Learning. However, these works encounter challenges in extending their capabilities to new tasks. Recent approaches integrate textual guidance or visual trajectory into decision networks to provide task-specific contextual cues, representing a promising direction. However, it is observed that relying solely on textual guidance or visual trajectory is insufficient for accurately conveying the contextual information of tasks. This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions, thereby facilitating a "read-to-play" capability. Drawing inspiration from the success of multimodal instruction tuning in visual tasks, we treat the visual-based RL task as a long-horizon vision task and construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer. Experimental results demonstrate that incorporating multimodal game instructions significantly enhances the decision transformer's multitasking and generalization capabilities.

Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction

TL;DR

Enhanced forms of task guidance for agents are explored, enabling them to comprehend gameplay instructions, thereby facilitating a "read-to-play"capability and demonstrating that incorporating multimodal game instructions significantly enhances the decision transformer's multitasking and generalization capabilities.

Abstract

Developing a generalist agent is a longstanding objective in artificial intelligence. Previous efforts utilizing extensive offline datasets from various tasks demonstrate remarkable performance in multitasking scenarios within Reinforcement Learning. However, these works encounter challenges in extending their capabilities to new tasks. Recent approaches integrate textual guidance or visual trajectory into decision networks to provide task-specific contextual cues, representing a promising direction. However, it is observed that relying solely on textual guidance or visual trajectory is insufficient for accurately conveying the contextual information of tasks. This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions, thereby facilitating a "read-to-play" capability. Drawing inspiration from the success of multimodal instruction tuning in visual tasks, we treat the visual-based RL task as a long-horizon vision task and construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer. Experimental results demonstrate that incorporating multimodal game instructions significantly enhances the decision transformer's multitasking and generalization capabilities.
Paper Structure (28 sections, 15 equations, 7 figures, 12 tables)

This paper contains 28 sections, 15 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Imagine an agent learning to play Palworld (a Pokémon-like game). (1) The agent exhibits confusion when only relying on textual guidance. (2) The agent is confused when presented with images of a Pal sphere and a Pal. (3) The agent understands how to catch a pet through multimodal guidance, which combines textual guidance with images of the Pal sphere and Pal.
  • Figure 2: An illustrative example of game instructions. Each instruction consists of three sections: game description, game trajectory, and game guidance (including action, language guidance, and the position of key elements)
  • Figure 3: Model architecture of Decision Transformer with Game Instruction (DTGI). Firstly, we undertake the representation of multimodal instructions (Section \ref{['instruction']}) Secondly, we calculate importance scores for each instruction in the Instruction set (Section \ref{['importance']}). Finally, We propose a novel design named SHyperGenerator to integrate game instructions into DT. N instruction generates n module parameters through hypernetworks. The module parameters are weighted based on the importance score of the instruction, and then utilized as adapter parameters (Section \ref{['dt']}).
  • Figure 4: Performance comparison of DT and our model under different dataset sizes.
  • Figure 5: Visualization of Instruction Importance scores for 10 training games and 4 unseen games and an in-depth analysis of the 28th training game reveals a correlation between higher scores and increased trajectory diversity.
  • ...and 2 more figures