On the effectiveness of Large Language Models for GitHub Workflows

Xinyu Zhang; Siddharth Muralee; Sourag Cherupattamoolayil; Aravind Machiry

On the effectiveness of Large Language Models for GitHub Workflows

Xinyu Zhang, Siddharth Muralee, Sourag Cherupattamoolayil, Aravind Machiry

TL;DR

The paper presents the first large-scale, systematic evaluation of three state-of-the-art LLMs (GPT-3.5 Turbo, StarChat, CodeLlama) and their instruction-finetuned variants on five GitHub workflow tasks, including generation, defect detection, and defect repair, using a dataset of $\\sim 4\times 10^5$ workflows. Through calibrated prompts and varied temperatures, the study reveals that detailed prompts improve generation realism (BLEU) but can increase syntactic and security defects, while fine-tuning generally helps some models but can harm performance on unseen repair tasks. Defect detection shows model- and prompt-dependent strengths, with StarChat excelling at syntactic-error detection and GPT-3.5 variants leading in locating error lines and in code-injection detection when fine-tuned. Defect repair proves the most challenging, with higher temperatures aiding syntactic repairs but little success in fixing code-injection vulnerabilities, suggesting the need for new LLM-assisted repair approaches in CI/CD contexts. Overall, the work illuminates the practical limits of current LLMs for workflow generation, defect detection, and repair, and provides a rich, open dataset to guide future research in LLM-assisted GitHub workflows.

Abstract

GitHub workflows or GitHub CI is a popular continuous integration platform that enables developers to automate various software engineering tasks by specifying them as workflows, i.e., YAML files with a list of jobs. However, engineering valid workflows is tedious. They are also prone to severe security issues, which can result in supply chain vulnerabilities. Recent advancements in Large Language Models (LLMs) have demonstrated their effectiveness in various software development tasks. However, GitHub workflows differ from regular programs in both structure and semantics. We perform the first comprehensive study to understand the effectiveness of LLMs on five workflow-related tasks with different levels of prompts. We curated a set of $\sim$400K workflows and generated prompts with varying detail. We also fine-tuned LLMs on GitHub workflow tasks. Our evaluation of three state-of-the-art LLMs and their fine-tuned variants revealed various interesting findings on the current effectiveness and drawbacks of LLMs.

On the effectiveness of Large Language Models for GitHub Workflows

TL;DR

workflows. Through calibrated prompts and varied temperatures, the study reveals that detailed prompts improve generation realism (BLEU) but can increase syntactic and security defects, while fine-tuning generally helps some models but can harm performance on unseen repair tasks. Defect detection shows model- and prompt-dependent strengths, with StarChat excelling at syntactic-error detection and GPT-3.5 variants leading in locating error lines and in code-injection detection when fine-tuned. Defect repair proves the most challenging, with higher temperatures aiding syntactic repairs but little success in fixing code-injection vulnerabilities, suggesting the need for new LLM-assisted repair approaches in CI/CD contexts. Overall, the work illuminates the practical limits of current LLMs for workflow generation, defect detection, and repair, and provides a rich, open dataset to guide future research in LLM-assisted GitHub workflows.

Abstract

400K workflows and generated prompts with varying detail. We also fine-tuned LLMs on GitHub workflow tasks. Our evaluation of three state-of-the-art LLMs and their fine-tuned variants revealed various interesting findings on the current effectiveness and drawbacks of LLMs.

Paper Structure (41 sections, 8 figures, 21 tables)

This paper contains 41 sections, 8 figures, 21 tables.

Introduction
Background and Related Work
GitHub workflows
Defects in workflows
LLM
Using LLMs for Automated Code Generation
Using LLMs for Automated Defect Detection
Using LLMs for Automated Program Repair (APR)
Study Design
LLM Selection
Dataset Collection
De-duplication and Filtering
Fine-Tuning Dataset
Instruction Fine-Tuning
Implementation Details.
...and 26 more sections

Figures (8)

Figure 1: Overview of Our Study.
Figure 2: Final evaluation for workflow generation
Figure 3: Final evaluation for code injection vulnerability detection.
Figure 4: Final evaluation for code injection vulnerability repair.
Figure 5: BLEU score (left) and Accuracy@K (right) against the size (in KB) of expected workflows.
...and 3 more figures

On the effectiveness of Large Language Models for GitHub Workflows

TL;DR

Abstract

On the effectiveness of Large Language Models for GitHub Workflows

Authors

TL;DR

Abstract

Table of Contents

Figures (8)