Table of Contents
Fetching ...

Odyssey: Empowering Minecraft Agents with Open-World Skills

Shunyu Liu, Yaoru Li, Kongcheng Zhang, Zhenyu Cui, Wenkai Fang, Yuxuan Zheng, Tongya Zheng, Mingli Song

TL;DR

Odyssey tackles the bottleneck of open-world agent development by introducing a rich, reusable open-world skill library and a planner-actor-critic framework that harnesses LLMs for long-horizon reasoning. By fine-tuning LLaMA-3 with a Minecraft-focused QA dataset (MineMA), and embedding a recursive skill-prerequisite mechanism, the approach enables efficient, compositional problem solving in Minecraft. The work also presents a comprehensive agent capability benchmark with long-term planning, dynamic-immediate planning, and autonomous exploration tasks, paired with domain-specific MCQ evaluations to quantify knowledge and reasoning. Experiments demonstrate that open-source MineMA models can match or surpass some GPT-4-based baselines while reducing costs, and ablations confirm the critical role of the skill library and planner in achieving robust open-world performance. Overall, Odyssey provides a scalable, accessible framework for advancing autonomous, generalist agents in complex environments and offers resources to foster future research across domains.

Abstract

Recent studies have delved into constructing generalist agents for open-world environments like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic programmatic tasks, e.g., material collection and tool-crafting following the Minecraft tech-tree, treating the ObtainDiamond task as the ultimate goal. This limitation stems from the narrowly defined set of actions available to agents, requiring them to learn effective long-horizon strategies from scratch. Consequently, discovering diverse gameplay opportunities in the open world becomes challenging. In this work, we introduce Odyssey, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. Odyssey comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills. (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki. (3) A new agent capability benchmark includes the long-term planning task, the dynamic-immediate planning task, and the autonomous exploration task. Extensive experiments demonstrate that the proposed Odyssey framework can effectively evaluate different capabilities of LLM-based agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.

Odyssey: Empowering Minecraft Agents with Open-World Skills

TL;DR

Odyssey tackles the bottleneck of open-world agent development by introducing a rich, reusable open-world skill library and a planner-actor-critic framework that harnesses LLMs for long-horizon reasoning. By fine-tuning LLaMA-3 with a Minecraft-focused QA dataset (MineMA), and embedding a recursive skill-prerequisite mechanism, the approach enables efficient, compositional problem solving in Minecraft. The work also presents a comprehensive agent capability benchmark with long-term planning, dynamic-immediate planning, and autonomous exploration tasks, paired with domain-specific MCQ evaluations to quantify knowledge and reasoning. Experiments demonstrate that open-source MineMA models can match or surpass some GPT-4-based baselines while reducing costs, and ablations confirm the critical role of the skill library and planner in achieving robust open-world performance. Overall, Odyssey provides a scalable, accessible framework for advancing autonomous, generalist agents in complex environments and offers resources to foster future research across domains.

Abstract

Recent studies have delved into constructing generalist agents for open-world environments like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic programmatic tasks, e.g., material collection and tool-crafting following the Minecraft tech-tree, treating the ObtainDiamond task as the ultimate goal. This limitation stems from the narrowly defined set of actions available to agents, requiring them to learn effective long-horizon strategies from scratch. Consequently, discovering diverse gameplay opportunities in the open world becomes challenging. In this work, we introduce Odyssey, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. Odyssey comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills. (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki. (3) A new agent capability benchmark includes the long-term planning task, the dynamic-immediate planning task, and the autonomous exploration task. Extensive experiments demonstrate that the proposed Odyssey framework can effectively evaluate different capabilities of LLM-based agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.
Paper Structure (60 sections, 9 figures, 9 tables)

This paper contains 60 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: An overview of the proposed Odyssey framework. Odyssey consists of three key components: (1) a fine-tuned LLaMA-3 model trained on a large-scale question-answering dataset; (2) an interactive agent equipped with an extensive open-world skill library; (3) a novel agent capability benchmark encompassing a variety of tasks.
  • Figure 2: An illustrative diagram of the interactive agent following a planner-actor-critic architecture based on the skill library. The Planner decomposes ultimate goals into specific subgoals, while the Actor sequentially executes code actions for each subgoal using the skill library. The Critic evaluates these actions through self-validation and reflection, enabling the agent to update its plan based on execution feedback.
  • Figure 3: Performance on the multi-round long-term planning task. Note that all data are from successful tasks.
  • Figure 4: Performance comparison of different models on autonomous exploration tasks. To make the results in figures clearer for readers, we adopt a 50% confidence interval to plot the error region.
  • Figure 5: An illustrative diagram of the skill recursive method for the mineDiamond task. The four colors depicted represent four different technological levels (wood, stone, iron, and diamond) following the Minecraft tech-tree.
  • ...and 4 more figures