Table of Contents
Fetching ...

DRS-OSS: LLM-Driven Diff Risk Scoring Tool for PR Risk Prediction

Ali Sayedsalehi, Peter C. Rigby, Audris Mockus

TL;DR

DRS-OSS presents an open-source, end-to-end diff risk scoring system for large-scale open-source projects, addressing the challenge of prioritizing thousands of PRs by predicting defect likelihood for each diff. It trains a fine-tuned Llama-3.1 8B classifier on ApacheJIT with long-context inputs and 4-bit QLoRA, enabling 22k-token contexts on a 20 GB GPU and achieving state-of-the-art performance (F1 0.641, ROC-AUC 0.895). The system provides a production-ready pipeline with an API, web UI, and GitHub App, and demonstrates gating potential such as preventing ~86.4% of defect-inducing changes by prioritizing the riskiest 30% of diffs. The work includes a full replication package and practical deployment guidance, highlighting real-world applicability and open-source reproducibility in diff-risk tooling.

Abstract

In large-scale open-source projects, hundreds of pull requests land daily, each a potential source of regressions. Diff Risk Scoring (DRS) estimates the likelihood that a diff will introduce a defect, enabling better review prioritization, test planning, and CI/CD gating. We present DRS-OSS, an open-source DRS system equipped with a public API, web UI, and GitHub plugin. DRS-OSS uses a fine-tuned Llama 3.1 8B sequence classifier trained on the ApacheJIT dataset, consuming long-context representations that combine commit messages, structured diffs, and change metrics. Through parameter-efficient adaptation, 4-bit QLoRA, and DeepSpeed ZeRO-3 CPU offloading, we train 22k-token contexts on a single 20 GB GPU. On the ApacheJIT benchmark, DRS-OSS achieves state-of-the-art performance (F1 = 0.64, ROC-AUC = 0.89). Simulations show that gating only the riskiest 30% of commits can prevent up to 86.4% of defect-inducing changes. The system integrates with developer workflows through an API gateway, a React dashboard, and a GitHub App that posts risk labels on pull requests. We release the full replication package, fine-tuning scripts, deployment artifacts, code, demo video, and public website.

DRS-OSS: LLM-Driven Diff Risk Scoring Tool for PR Risk Prediction

TL;DR

DRS-OSS presents an open-source, end-to-end diff risk scoring system for large-scale open-source projects, addressing the challenge of prioritizing thousands of PRs by predicting defect likelihood for each diff. It trains a fine-tuned Llama-3.1 8B classifier on ApacheJIT with long-context inputs and 4-bit QLoRA, enabling 22k-token contexts on a 20 GB GPU and achieving state-of-the-art performance (F1 0.641, ROC-AUC 0.895). The system provides a production-ready pipeline with an API, web UI, and GitHub App, and demonstrates gating potential such as preventing ~86.4% of defect-inducing changes by prioritizing the riskiest 30% of diffs. The work includes a full replication package and practical deployment guidance, highlighting real-world applicability and open-source reproducibility in diff-risk tooling.

Abstract

In large-scale open-source projects, hundreds of pull requests land daily, each a potential source of regressions. Diff Risk Scoring (DRS) estimates the likelihood that a diff will introduce a defect, enabling better review prioritization, test planning, and CI/CD gating. We present DRS-OSS, an open-source DRS system equipped with a public API, web UI, and GitHub plugin. DRS-OSS uses a fine-tuned Llama 3.1 8B sequence classifier trained on the ApacheJIT dataset, consuming long-context representations that combine commit messages, structured diffs, and change metrics. Through parameter-efficient adaptation, 4-bit QLoRA, and DeepSpeed ZeRO-3 CPU offloading, we train 22k-token contexts on a single 20 GB GPU. On the ApacheJIT benchmark, DRS-OSS achieves state-of-the-art performance (F1 = 0.64, ROC-AUC = 0.89). Simulations show that gating only the riskiest 30% of commits can prevent up to 86.4% of defect-inducing changes. The system integrates with developer workflows through an API gateway, a React dashboard, and a GitHub App that posts risk labels on pull requests. We release the full replication package, fine-tuning scripts, deployment artifacts, code, demo video, and public website.

Paper Structure

This paper contains 30 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: End-to-end DRS-OSS architecture.
  • Figure 2: Commit structuring changes the unified diff into a simpler format that can be better understood by LLMs. It also reduces the number of input tokens in the sequence.
  • Figure 3: Demo result on the Hive OOM-fix commit: derived label (risky/safe) and confidence for the predicted label.
  • Figure 4: GitHub bot output on a simple PR with three commits with label and confidence for each commit in the PR.
  • Figure 5: DRS-OSS Web UI landing page with GitHub and manual analysis modes.
  • ...and 3 more figures