Table of Contents
Fetching ...

A Hybrid Co-Finetuning Approach for Visual Bug Detection in Video Games

Faliu Yi, Sherif Abdelfattah, Wei Huang, Adrian Brown

TL;DR

The paper tackles visual bug detection in video games under data scarcity, focusing on multi-frame bugs that require temporal context. It introduces Hybrid Co-Finetuning (CFT), a two-pronged approach combining co-supervised learning across target and co-domain titles with a self-supervised latent reconstruction objective, including target-distillation from pretrained vision encoders. Key contributions include the L_{od} detection loss, L_{co\_sup} fusion with an alpha-weighted co-title signal, and the L_{CFT} latent SSL loss with a fixed masking ratio, yielding superior data efficiency and robustness across three game environments. Empirical results show CFT outperforming Azure AutoML baselines on mAP and F1, even when trained with only 50% of the target annotations, and ablations confirm the importance of both CSL and SSL components and the effectiveness of ViT backbones and MAE/DINOv1 SSL targets.

Abstract

Manual identification of visual bugs in video games is a resource-intensive and costly process, often demanding specialized domain knowledge. While supervised visual bug detection models offer a promising solution, their reliance on extensive labeled datasets presents a significant challenge due to the infrequent occurrence of such bugs. To overcome this limitation, we propose a hybrid Co-FineTuning (CFT) method that effectively integrates both labeled and unlabeled data. Our approach leverages labeled samples from the target game and diverse co-domain games, additionally incorporating unlabeled data to enhance feature representation learning. This strategy maximizes the utility of all available data, substantially reducing the dependency on labeled examples from the specific target game. The developed framework demonstrates enhanced scalability and adaptability, facilitating efficient visual bug detection across various game titles. Our experimental results show the robustness of the proposed method for game visual bug detection, exhibiting superior performance compared to conventional baselines across multiple gaming environments. Furthermore, CFT maintains competitive performance even when trained with only 50% of the labeled data from the target game.

A Hybrid Co-Finetuning Approach for Visual Bug Detection in Video Games

TL;DR

The paper tackles visual bug detection in video games under data scarcity, focusing on multi-frame bugs that require temporal context. It introduces Hybrid Co-Finetuning (CFT), a two-pronged approach combining co-supervised learning across target and co-domain titles with a self-supervised latent reconstruction objective, including target-distillation from pretrained vision encoders. Key contributions include the L_{od} detection loss, L_{co\_sup} fusion with an alpha-weighted co-title signal, and the L_{CFT} latent SSL loss with a fixed masking ratio, yielding superior data efficiency and robustness across three game environments. Empirical results show CFT outperforming Azure AutoML baselines on mAP and F1, even when trained with only 50% of the target annotations, and ablations confirm the importance of both CSL and SSL components and the effectiveness of ViT backbones and MAE/DINOv1 SSL targets.

Abstract

Manual identification of visual bugs in video games is a resource-intensive and costly process, often demanding specialized domain knowledge. While supervised visual bug detection models offer a promising solution, their reliance on extensive labeled datasets presents a significant challenge due to the infrequent occurrence of such bugs. To overcome this limitation, we propose a hybrid Co-FineTuning (CFT) method that effectively integrates both labeled and unlabeled data. Our approach leverages labeled samples from the target game and diverse co-domain games, additionally incorporating unlabeled data to enhance feature representation learning. This strategy maximizes the utility of all available data, substantially reducing the dependency on labeled examples from the specific target game. The developed framework demonstrates enhanced scalability and adaptability, facilitating efficient visual bug detection across various game titles. Our experimental results show the robustness of the proposed method for game visual bug detection, exhibiting superior performance compared to conventional baselines across multiple gaming environments. Furthermore, CFT maintains competitive performance even when trained with only 50% of the labeled data from the target game.

Paper Structure

This paper contains 17 sections, 3 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: A typical workflow for visual bug detection in video games.
  • Figure 2: A block diagram for the Co-finetuning method design. It involves two main objectives: a self-supervised target distillation objective and a co-supervised object detection objective.
  • Figure 3: Illustration for our proposed self-supervised objective. We follow a latent masked autoencoder with a target distillation criterion, where an input frame is patchified and masked, then we feed the original version into a target encoder and the masked one to our student encoder, finally, we calculate the reconstruction error on the latent space, mainly considering masked patches.
  • Figure 4: Representative frames from the GiantMap game.
  • Figure 5: Representative frames from the HighRise game.
  • ...and 3 more figures