Table of Contents
Fetching ...

Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds

Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, Yujia Qin, Bo An, Libin Liu, Guang Shi

TL;DR

Lumine presents an open recipe for building generalist agents in 3D open worlds by grounding perception, reasoning, and action in a vision–language model and training through a three-stage curriculum (pre-training, instruction following, and reasoning). The 7B Lumine-Base model processes $5~\mathrm{Hz}$ pixel input and outputs actions at $30~\mathrm{Hz}$, employing a hybrid thinking mechanism to invoke explicit reasoning only when needed, and achieves real-time, long-horizon mission completion in Genshin Impact with strong zero-shot cross-game generalization to other titles. The approach integrates efficient inference strategies, memory-based context management, and curriculum-driven data curation to enable robust, language-guided manipulation across both 3D exploration and 2D GUI tasks, including unseen environments. The results indicate substantial gains in scalability, instruction following, and long-horizon reasoning, with cross-game transfer to Wuthering Waves and Honkai: Star Rail, while highlighting areas for future work in long-term memory, online learning, and broader-scale data. Overall, Lumine demonstrates the feasibility of a generalist, language-grounded agent operating in open-ended environments with practical real-time constraints. $5~\mathrm{Hz}$ perception, $30~\mathrm{Hz}$ actuation, and a $25.3\times$ latency reduction together enable smooth, responsive control in complex worlds, marking a significant step toward general-purpose decision foundation models for embodied AI.

Abstract

We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary. Trained in Genshin Impact, Lumine successfully completes the entire five-hour Mondstadt main storyline on par with human-level efficiency and follows natural language instructions to perform a broad spectrum of tasks in both 3D open-world exploration and 2D GUI manipulation across collection, combat, puzzle-solving, and NPC interaction. In addition to its in-domain performance, Lumine demonstrates strong zero-shot cross-game generalization. Without any fine-tuning, it accomplishes 100-minute missions in Wuthering Waves and the full five-hour first chapter of Honkai: Star Rail. These promising results highlight Lumine's effectiveness across distinct worlds and interaction dynamics, marking a concrete step toward generalist agents in open-ended environments.

Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds

TL;DR

Lumine presents an open recipe for building generalist agents in 3D open worlds by grounding perception, reasoning, and action in a vision–language model and training through a three-stage curriculum (pre-training, instruction following, and reasoning). The 7B Lumine-Base model processes pixel input and outputs actions at , employing a hybrid thinking mechanism to invoke explicit reasoning only when needed, and achieves real-time, long-horizon mission completion in Genshin Impact with strong zero-shot cross-game generalization to other titles. The approach integrates efficient inference strategies, memory-based context management, and curriculum-driven data curation to enable robust, language-guided manipulation across both 3D exploration and 2D GUI tasks, including unseen environments. The results indicate substantial gains in scalability, instruction following, and long-horizon reasoning, with cross-game transfer to Wuthering Waves and Honkai: Star Rail, while highlighting areas for future work in long-term memory, online learning, and broader-scale data. Overall, Lumine demonstrates the feasibility of a generalist, language-grounded agent operating in open-ended environments with practical real-time constraints. perception, actuation, and a latency reduction together enable smooth, responsive control in complex worlds, marking a significant step toward general-purpose decision foundation models for embodied AI.

Abstract

We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary. Trained in Genshin Impact, Lumine successfully completes the entire five-hour Mondstadt main storyline on par with human-level efficiency and follows natural language instructions to perform a broad spectrum of tasks in both 3D open-world exploration and 2D GUI manipulation across collection, combat, puzzle-solving, and NPC interaction. In addition to its in-domain performance, Lumine demonstrates strong zero-shot cross-game generalization. Without any fine-tuning, it accomplishes 100-minute missions in Wuthering Waves and the full five-hour first chapter of Honkai: Star Rail. These promising results highlight Lumine's effectiveness across distinct worlds and interaction dynamics, marking a concrete step toward generalist agents in open-ended environments.

Paper Structure

This paper contains 35 sections, 2 equations, 27 figures, 7 tables.

Figures (27)

  • Figure 1: Lumine, the first AI agent to complete hours-long missions in real time within expansive 3D open worlds.
  • Figure 2: Overview of the gameplay environment in Genshin Impact. The game combines large-scale open-world exploration and multi-level reasoning challenges within a richly interactive 3D environment. Players can freely traverse diverse regions, glide, swim, dive, and interact with characters while engaging in quests, puzzles, and combat.
  • Figure 3: Overview of the Lumine model. Built upon a VLM, Lumine receives pixel inputs along with historical context, such as previous actions and reasoning, and outputs textual keyboard and mouse actions. It employs a hybrid reasoning strategy, generating new reasoning steps only when necessary; otherwise, it directly produces actions for efficient real-time control.
  • Figure 4: Overview of Lumine’s three-stage training recipe. In the first pre-training stage, Qwen2-VL-Base is trained on large-scale image–action data to learn fundamental action primitives, resulting in the Lumine-Base model. In the second instruction-following stage, Lumine-Base is further trained on instruction–image–action triplets for language grounding, producing the Lumine-Instruct model. In the final reasoning stage, the instruction input is replaced with a thought, and an optional new thought is prepended before the action sequence, yielding the Lumine-Thinking model.
  • Figure 5: Overview of the data processing pipeline from raw gameplay recordings to curated datasets for pre-training, instruction following, and reasoning. i) Starting from 2424 hours of synchronized video-action data, we first apply rule-based filtering to produce a 1731-hour dataset for pre-training. ii) A subset of 165 hours is human-annotated for instruction-level activities, used to train a classifier that auto-labels all the raw data, further refined into 200 hours of high-quality instruction following data via GPT-4.1 captioning and action filtering. iii) Meanwhile, 15 hours of manually annotated reasoning data support the training of Lumine’s hybrid thinking. Together, this multi-stage curation pipeline enables scalable, structured curriculum learning from human demonstrations.
  • ...and 22 more figures