Table of Contents
Fetching ...

Explaining GitHub Actions Failures with Large Language Models: Challenges, Insights, and Limitations

Pablo Valenzuela-Toledo, Chuyue Wu, Sandro Hernandez, Alexander Boll, Roman Machacek, Sebastiano Panichella, Timo Kehrer

TL;DR

This study investigates whether large language models can generate useful explanations for GitHub Actions run failures drawn from unstructured logs. Using a mixed-methods feasibility approach with 31 developers and 10 failure logs, the authors evaluate correctness, conciseness, clarity, and actionability, finding strong support for simple cases but limitations for complex CI/CD scenarios. They identify one-shot prompting with specific models as effective and outline five dimensions of actionability, emphasizing the importance of log preprocessing and tailoring explanations to user expertise. The work offers practical guidance for integrating LLM-powered diagnostics into GA workflows and establishes a foundation for future improvements in LLM reasoning and adaptive explanations.

Abstract

GitHub Actions (GA) has become the de facto tool that developers use to automate software workflows, seamlessly building, testing, and deploying code. Yet when GA fails, it disrupts development, causing delays and driving up costs. Diagnosing failures becomes especially challenging because error logs are often long, complex and unstructured. Given these difficulties, this study explores the potential of large language models (LLMs) to generate correct, clear, concise, and actionable contextual descriptions (or summaries) for GA failures, focusing on developers' perceptions of their feasibility and usefulness. Our results show that over 80\% of developers rated LLM explanations positively in terms of correctness for simpler/small logs. Overall, our findings suggest that LLMs can feasibly assist developers in understanding common GA errors, thus, potentially reducing manual analysis. However, we also found that improved reasoning abilities are needed to support more complex CI/CD scenarios. For instance, less experienced developers tend to be more positive on the described context, while seasoned developers prefer concise summaries. Overall, our work offers key insights for researchers enhancing LLM reasoning, particularly in adapting explanations to user expertise.

Explaining GitHub Actions Failures with Large Language Models: Challenges, Insights, and Limitations

TL;DR

This study investigates whether large language models can generate useful explanations for GitHub Actions run failures drawn from unstructured logs. Using a mixed-methods feasibility approach with 31 developers and 10 failure logs, the authors evaluate correctness, conciseness, clarity, and actionability, finding strong support for simple cases but limitations for complex CI/CD scenarios. They identify one-shot prompting with specific models as effective and outline five dimensions of actionability, emphasizing the importance of log preprocessing and tailoring explanations to user expertise. The work offers practical guidance for integrating LLM-powered diagnostics into GA workflows and establishes a foundation for future improvements in LLM reasoning and adaptive explanations.

Abstract

GitHub Actions (GA) has become the de facto tool that developers use to automate software workflows, seamlessly building, testing, and deploying code. Yet when GA fails, it disrupts development, causing delays and driving up costs. Diagnosing failures becomes especially challenging because error logs are often long, complex and unstructured. Given these difficulties, this study explores the potential of large language models (LLMs) to generate correct, clear, concise, and actionable contextual descriptions (or summaries) for GA failures, focusing on developers' perceptions of their feasibility and usefulness. Our results show that over 80\% of developers rated LLM explanations positively in terms of correctness for simpler/small logs. Overall, our findings suggest that LLMs can feasibly assist developers in understanding common GA errors, thus, potentially reducing manual analysis. However, we also found that improved reasoning abilities are needed to support more complex CI/CD scenarios. For instance, less experienced developers tend to be more positive on the described context, while seasoned developers prefer concise summaries. Overall, our work offers key insights for researchers enhancing LLM reasoning, particularly in adapting explanations to user expertise.

Paper Structure

This paper contains 16 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Survey definition: Statements 1 through 10 are closed-ended, while questions 11 and 12 are open-ended (indicated with a star $\star$).
  • Figure 2: Partial view of the LogExp tool's interface. The log is displayed on the left, allowing participants to choose between viewing a summary or the full log. On the right, the corresponding textual explanation generated by the LLM is presented. Below these sections, participants encountered statements and questions specific to each case.
  • Figure 3: The stacked bar chart shows the levels of agreement of participants to our statements \ref{['st:one']}, \ref{['st:three']}, \ref{['st:four']}, \ref{['st:five']}, and \ref{['st:six']}.
  • Figure 4: The stacked bar chart shows the levels of agreement of participants to our statements \ref{['st:two']}, \ref{['st:seven']}, \ref{['st:eight']}, \ref{['st:nine']}, and \ref{['st:ten']}.