Table of Contents
Fetching ...

Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents

Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, Wanjun Zhong, Zili Li, Yu Wang, Yu Miao, Bo Zhou, Yuanfan Li, Hao Wang, Zhongkai Zhao, Faming Wu, Zhengxuan Jiang, Weihao Tan, Heyuan Yao, Shi Yan, Xiangyang Li, Yitao Liang, Yujia Qin, Guang Shi

TL;DR

This work tackles building a scalable generalist agent capable of operating across diverse games and computer-use tasks by grounding actions in a unified, native keyboard-mouse interface. It introduces Game-TARS, combining a scalable Human-Native Interaction action space, sparse think-aloud pretraining, and decaying loss to reduce causal confusion, followed by targeted post-training with instruction following, multimodal prompts, and cross-domain data. The approach yields strong, cross-domain generalization, achieving roughly 2x state-of-the-art performance in open-world Minecraft, near-human performance on unseen web 3D games, and superior FPS benchmarks compared with large LLM/VLM baselines. These results support a scalable path toward generalist computer-use agents through simple action representations and massive multimodal pretraining.

Abstract

We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned native keyboard-mouse inputs. Unlike API- or GUI-based approaches, this paradigm enables large-scale continual pre-training across heterogeneous domains, including OS, web, and simulation games. Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data. Key techniques include a decaying continual loss to reduce causal confusion and an efficient Sparse-Thinking strategy that balances reasoning depth and inference cost. Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks, is close to the generality of fresh humans in unseen web 3d games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks. Scaling results on training-time and test-time confirm that the unified action space sustains improvements when scaled to cross-game and multimodal data. Our results demonstrate that simple, scalable action representations combined with large-scale pre-training provide a promising path toward generalist agents with broad computer-use abilities.

Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents

TL;DR

This work tackles building a scalable generalist agent capable of operating across diverse games and computer-use tasks by grounding actions in a unified, native keyboard-mouse interface. It introduces Game-TARS, combining a scalable Human-Native Interaction action space, sparse think-aloud pretraining, and decaying loss to reduce causal confusion, followed by targeted post-training with instruction following, multimodal prompts, and cross-domain data. The approach yields strong, cross-domain generalization, achieving roughly 2x state-of-the-art performance in open-world Minecraft, near-human performance on unseen web 3D games, and superior FPS benchmarks compared with large LLM/VLM baselines. These results support a scalable path toward generalist computer-use agents through simple action representations and massive multimodal pretraining.

Abstract

We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned native keyboard-mouse inputs. Unlike API- or GUI-based approaches, this paradigm enables large-scale continual pre-training across heterogeneous domains, including OS, web, and simulation games. Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data. Key techniques include a decaying continual loss to reduce causal confusion and an efficient Sparse-Thinking strategy that balances reasoning depth and inference cost. Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks, is close to the generality of fresh humans in unseen web 3d games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks. Scaling results on training-time and test-time confirm that the unified action space sustains improvements when scaled to cross-game and multimodal data. Our results demonstrate that simple, scalable action representations combined with large-scale pre-training provide a promising path toward generalist agents with broad computer-use abilities.

Paper Structure

This paper contains 48 sections, 9 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Game-TARS achieves a higher level of performance compared to humans, domain experts, and general VLMs in unseen 3D virtual environments, including open-world minedojo, FPS games vizdoom, web games, and simulators miniworld.
  • Figure 2: Generalist Game Agent Game-TARS. Game-TARS can interpret and respond to various human instructions across diverse environments using a single neural network with a consistent set of weights. It was pre-trained on a wide range of multimodal datasets, including vision-language question-answering, captioning, over 20k hours of game trajectories, GUI agent trajectories, and more.
  • Figure 3: The pipeline of Think-Aloud data collection and post-processing. This process captures and synchronizes three types of original inputs (screen, keyboard, and mouse, audio), refines sparse-thinking through the ASR-LLM pipeline, and uses a timestamp aligner to synthesize the final (Instruction, Observation, Thinking, Action) datasets.
  • Figure 4: Distribution of different game types in the Game-TARS training dataset.
  • Figure 5: Game-TARS is trained on a wide range of games, including adventure, shooting, role-playing, and racing.
  • ...and 10 more figures