Table of Contents
Fetching ...

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang, Junpeng Yue, Hao Luo, Ziluo Ding, Zongqing Lu

TL;DR

The paper tackles the difficulty of providing informative rewards for open-ended reinforcement learning in Minecraft by building a high-quality YouTube-based dataset and a cross-modal, RL-friendly vision-language model. It introduces CLIP4MC, an enhanced MineCLIP-inspired VLM whose training includes a novel local-entity-size signal and a swap-based contrastive scheme to reflect task completion, producing richer intrinsic rewards. Empirical results on MineDojo tasks show improved RL performance, especially on challenging hunt tasks, and demonstrate the importance of both dataset quality (via correlation filtering) and objective design. The work provides open-source data and methods to scale RL with internet-scale multimodal knowledge for embodied agents.

Abstract

One of the essential missions in the AI research community is to build an autonomous embodied agent that can achieve high-level performance across a wide spectrum of tasks. However, acquiring or manually designing rewards for all open-ended tasks is unrealistic. In this paper, we propose a novel cross-modal contrastive learning framework architecture, CLIP4MC, aiming to learn a reinforcement learning (RL) friendly vision-language model (VLM) that serves as an intrinsic reward function for open-ended tasks. Simply utilizing the similarity between the video snippet and the language prompt is not RL-friendly since standard VLMs may only capture the similarity at a coarse level. To achieve RL-friendliness, we incorporate the task completion degree into the VLM training objective, as this information can assist agents in distinguishing the importance between different states. Moreover, we provide neat YouTube datasets based on the large-scale YouTube database provided by MineDojo. Specifically, two rounds of filtering operations guarantee that the dataset covers enough essential information and that the video-text pair is highly correlated. Empirically, we demonstrate that the proposed method achieves better performance on RL tasks compared with baselines. The code and datasets are available at https://github.com/PKU-RL/CLIP4MC.

Reinforcement Learning Friendly Vision-Language Model for Minecraft

TL;DR

The paper tackles the difficulty of providing informative rewards for open-ended reinforcement learning in Minecraft by building a high-quality YouTube-based dataset and a cross-modal, RL-friendly vision-language model. It introduces CLIP4MC, an enhanced MineCLIP-inspired VLM whose training includes a novel local-entity-size signal and a swap-based contrastive scheme to reflect task completion, producing richer intrinsic rewards. Empirical results on MineDojo tasks show improved RL performance, especially on challenging hunt tasks, and demonstrate the importance of both dataset quality (via correlation filtering) and objective design. The work provides open-source data and methods to scale RL with internet-scale multimodal knowledge for embodied agents.

Abstract

One of the essential missions in the AI research community is to build an autonomous embodied agent that can achieve high-level performance across a wide spectrum of tasks. However, acquiring or manually designing rewards for all open-ended tasks is unrealistic. In this paper, we propose a novel cross-modal contrastive learning framework architecture, CLIP4MC, aiming to learn a reinforcement learning (RL) friendly vision-language model (VLM) that serves as an intrinsic reward function for open-ended tasks. Simply utilizing the similarity between the video snippet and the language prompt is not RL-friendly since standard VLMs may only capture the similarity at a coarse level. To achieve RL-friendliness, we incorporate the task completion degree into the VLM training objective, as this information can assist agents in distinguishing the importance between different states. Moreover, we provide neat YouTube datasets based on the large-scale YouTube database provided by MineDojo. Specifically, two rounds of filtering operations guarantee that the dataset covers enough essential information and that the video-text pair is highly correlated. Empirically, we demonstrate that the proposed method achieves better performance on RL tasks compared with baselines. The code and datasets are available at https://github.com/PKU-RL/CLIP4MC.
Paper Structure (34 sections, 5 equations, 10 figures, 9 tables)

This paper contains 34 sections, 5 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Illustration of the YouTube video database. The screenshots of video clips are on the left and key entities are circled in red. The corresponding transcript clips are on the right and key entities are marked in red. We give examples of irrelevant, mismatched, and matched video content in the YouTube video database.
  • Figure 2: Examples of how we estimate the size of key entities in video frames. Red bounding boxes are generated by our modified MineCLIP visual encoder, following the approach proposed in MaskCLIP zhou2022extract. These boxes are then used to calculate the size.
  • Figure 3: Illustration of CLIP4MC training. The upper part shows the concept of contrastive learning, while the lower part explains the swapping operation.
  • Figure 3: Results of video-to-text / text-to-video retrieval on the test set. The best results are highlighted in bold.
  • Figure 4: Scatter plots illustrating the relationship between the entity size and the intrinsic reward. The red line indicates a linear fit to the data.
  • ...and 5 more figures