Table of Contents
Fetching ...

Leveraging Large Language Models for Efficient Failure Analysis in Game Development

Leonardo Marini, Linus Gisslén, Alessandro Sestini

TL;DR

This work tackles the problem of identifying which code change caused a failing test when many changes are batched in large-scale game development. It presents a BERT-based pipeline that, given an error message and a set of candidate commits, scores each candidate and selects the most likely root cause, achieving $0.71$ accuracy on a real-world dataset. The method is integrated into an existing development framework and is evaluated through quantitative experiments and a user study, demonstrating up to $60\%$ time savings for developers and significant practical value. The study highlights both the feasibility and the impact of LLM-assisted failure analysis in AAA game pipelines, while outlining limitations and avenues for future enhancement with richer signals and non-deterministic failure handling.

Abstract

In games, and more generally in the field of software development, early detection of bugs is vital to maintain a high quality of the final product. Automated tests are a powerful tool that can catch a problem earlier in development by executing periodically. As an example, when new code is submitted to the code base, a new automated test verifies these changes. However, identifying the specific change responsible for a test failure becomes harder when dealing with batches of changes -- especially in the case of a large-scale project such as a AAA game, where thousands of people contribute to a single code base. This paper proposes a new approach to automatically identify which change in the code caused a test to fail. The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure. We investigate the effectiveness of our approach with quantitative and qualitative evaluations. Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year. We further evaluated our model through a user study to assess the utility and usability of the tool from a developer perspective, resulting in a significant reduction in time -- up to 60% -- spent investigating issues.

Leveraging Large Language Models for Efficient Failure Analysis in Game Development

TL;DR

This work tackles the problem of identifying which code change caused a failing test when many changes are batched in large-scale game development. It presents a BERT-based pipeline that, given an error message and a set of candidate commits, scores each candidate and selects the most likely root cause, achieving accuracy on a real-world dataset. The method is integrated into an existing development framework and is evaluated through quantitative experiments and a user study, demonstrating up to time savings for developers and significant practical value. The study highlights both the feasibility and the impact of LLM-assisted failure analysis in AAA game pipelines, while outlining limitations and avenues for future enhancement with richer signals and non-deterministic failure handling.

Abstract

In games, and more generally in the field of software development, early detection of bugs is vital to maintain a high quality of the final product. Automated tests are a powerful tool that can catch a problem earlier in development by executing periodically. As an example, when new code is submitted to the code base, a new automated test verifies these changes. However, identifying the specific change responsible for a test failure becomes harder when dealing with batches of changes -- especially in the case of a large-scale project such as a AAA game, where thousands of people contribute to a single code base. This paper proposes a new approach to automatically identify which change in the code caused a test to fail. The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure. We investigate the effectiveness of our approach with quantitative and qualitative evaluations. Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year. We further evaluated our model through a user study to assess the utility and usability of the tool from a developer perspective, resulting in a significant reduction in time -- up to 60% -- spent investigating issues.
Paper Structure (13 sections, 2 equations, 3 figures, 3 tables)

This paper contains 13 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Mock-up of the tool showing the main existing features and the additional "Identify" button (in (a)). Each entry in the view represents a failed test. At the top of each entry, we can see the error message $e$. Below the message, the framework shows a list of IDs: each ID corresponds to a commit $c_i$. Once a user presses the identify button, the model will execute, and the estimated ID that caused the error will be highlighted (i.e., number 134 in (b)). Once a user identifies the correct ID, they can claim it, and the view will display "Claimed by: ID".
  • Figure 2: Time saved by each participant. In (a), the percentage indicates the amount of time saved by each participant compared to the previous manual approach. In (b), the time spent investigating an issue with the previous manual approach and after our framework. For both plots, the X-axis represents the participants (e.g. P1 means participant 1).
  • Figure 3: Two examples of error messages and relative commits. In red is the error message, and in green is the commit that caused the error. The model takes as input the couple $(e, c_i)$ for each $c_i$ and outputs a score indicating how probable that commit caused the error message.