Playing Non-Embedded Card-Based Games with Reinforcement Learning
Tianyang Wu, Lipeng Wan, Yuhang Wang, Qiang Wan, Xuguang Lan
TL;DR
This work tackles the challenge of building non-embedded, real-time AI agents for complex card-based RTS games like Clash Royale, where agents must rely on noisy visual inputs rather than exact game state. The authors propose an offline reinforcement learning framework that uses visual perception outputs, a generative dataset for object detection, and a transformer-based decision model (inspired by Decision Transformer and StARformer) to fuse perception with action, enabling autonomous play on mobile devices. Key contributions include a generative, AI-assisted labeling pipeline for object detection, evaluation of YOLOv8 variants for fast and accurate unit detection, and a delayed, continuous action prediction strategy with resampling to address data imbalances in offline datasets. The results show the approach can defeat built-in AI in Clash Royale and run in real time on mobile hardware, highlighting the viability of non-embedded offline RL for complex, vision-driven RTS tasks and offering a foundation for further online RL and perception-architecture improvements.
Abstract
Significant progress has been made in AI for games, including board games, MOBA, and RTS games. However, complex agents are typically developed in an embedded manner, directly accessing game state information, unlike human players who rely on noisy visual data, leading to unfair competition. Developing complex non-embedded agents remains challenging, especially in card-based RTS games with complex features and large state spaces. We propose a non-embedded offline reinforcement learning training strategy using visual inputs to achieve real-time autonomous gameplay in the RTS game Clash Royale. Due to the lack of a object detection dataset for this game, we designed an efficient generative object detection dataset for training. We extract features using state-of-the-art object detection and optical character recognition models. Our method enables real-time image acquisition, perception feature fusion, decision-making, and control on mobile devices, successfully defeating built-in AI opponents. All code is open-sourced at https://github.com/wty-yy/katacr.
