A Hybrid Co-Finetuning Approach for Visual Bug Detection in Video Games
Faliu Yi, Sherif Abdelfattah, Wei Huang, Adrian Brown
TL;DR
The paper tackles visual bug detection in video games under data scarcity, focusing on multi-frame bugs that require temporal context. It introduces Hybrid Co-Finetuning (CFT), a two-pronged approach combining co-supervised learning across target and co-domain titles with a self-supervised latent reconstruction objective, including target-distillation from pretrained vision encoders. Key contributions include the L_{od} detection loss, L_{co\_sup} fusion with an alpha-weighted co-title signal, and the L_{CFT} latent SSL loss with a fixed masking ratio, yielding superior data efficiency and robustness across three game environments. Empirical results show CFT outperforming Azure AutoML baselines on mAP and F1, even when trained with only 50% of the target annotations, and ablations confirm the importance of both CSL and SSL components and the effectiveness of ViT backbones and MAE/DINOv1 SSL targets.
Abstract
Manual identification of visual bugs in video games is a resource-intensive and costly process, often demanding specialized domain knowledge. While supervised visual bug detection models offer a promising solution, their reliance on extensive labeled datasets presents a significant challenge due to the infrequent occurrence of such bugs. To overcome this limitation, we propose a hybrid Co-FineTuning (CFT) method that effectively integrates both labeled and unlabeled data. Our approach leverages labeled samples from the target game and diverse co-domain games, additionally incorporating unlabeled data to enhance feature representation learning. This strategy maximizes the utility of all available data, substantially reducing the dependency on labeled examples from the specific target game. The developed framework demonstrates enhanced scalability and adaptability, facilitating efficient visual bug detection across various game titles. Our experimental results show the robustness of the proposed method for game visual bug detection, exhibiting superior performance compared to conventional baselines across multiple gaming environments. Furthermore, CFT maintains competitive performance even when trained with only 50% of the labeled data from the target game.
