Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds
Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, Yujia Qin, Bo An, Libin Liu, Guang Shi
TL;DR
Lumine presents an open recipe for building generalist agents in 3D open worlds by grounding perception, reasoning, and action in a vision–language model and training through a three-stage curriculum (pre-training, instruction following, and reasoning). The 7B Lumine-Base model processes $5~\mathrm{Hz}$ pixel input and outputs actions at $30~\mathrm{Hz}$, employing a hybrid thinking mechanism to invoke explicit reasoning only when needed, and achieves real-time, long-horizon mission completion in Genshin Impact with strong zero-shot cross-game generalization to other titles. The approach integrates efficient inference strategies, memory-based context management, and curriculum-driven data curation to enable robust, language-guided manipulation across both 3D exploration and 2D GUI tasks, including unseen environments. The results indicate substantial gains in scalability, instruction following, and long-horizon reasoning, with cross-game transfer to Wuthering Waves and Honkai: Star Rail, while highlighting areas for future work in long-term memory, online learning, and broader-scale data. Overall, Lumine demonstrates the feasibility of a generalist, language-grounded agent operating in open-ended environments with practical real-time constraints. $5~\mathrm{Hz}$ perception, $30~\mathrm{Hz}$ actuation, and a $25.3\times$ latency reduction together enable smooth, responsive control in complex worlds, marking a significant step toward general-purpose decision foundation models for embodied AI.
Abstract
We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary. Trained in Genshin Impact, Lumine successfully completes the entire five-hour Mondstadt main storyline on par with human-level efficiency and follows natural language instructions to perform a broad spectrum of tasks in both 3D open-world exploration and 2D GUI manipulation across collection, combat, puzzle-solving, and NPC interaction. In addition to its in-domain performance, Lumine demonstrates strong zero-shot cross-game generalization. Without any fine-tuning, it accomplishes 100-minute missions in Wuthering Waves and the full five-hour first chapter of Honkai: Star Rail. These promising results highlight Lumine's effectiveness across distinct worlds and interaction dynamics, marking a concrete step toward generalist agents in open-ended environments.
