Table of Contents
Fetching ...

SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks

Sanket Mhatre, Yasharth Bajpai, Sumit Gulwani, Emerson Murphy-Hill, Gustavo Soares

TL;DR

This work addresses the underrepresentation of C#/.NET in software-engineering benchmarks by introducing SWE-Sharp-Bench, a reproducible repository-level benchmark with 150 C# tasks from 17 projects. It builds on the SWE-Bench methodology, including automated environment generation, multi-stage PR-based curation, and manual verification, to enable cross-language evaluation using SWE-Agent and OpenHands across OpenAI and Anthropic models. The results reveal a substantial performance gap between Python and C#/Java, with C# tasks being more challenging due to patch breadth, multi-file edits, and language-specific tooling complexity; model choice also strongly drives success. The open-source curation pipeline and data enable ongoing benchmarking and analysis, providing a foundation for improving AI-assisted C# software engineering in enterprise contexts.

Abstract

AI coding agents have shown great progress on Python software engineering benchmarks like SWE-Bench, and for other languages like Java and C in benchmarks like Multi-SWE-Bench. However, C# -- a prominent enterprise language ranking #5 in the TIOBE index -- remains absent from such benchmarks. We introduce SWE-Sharp-Bench, a reproducible software engineering benchmark for C# featuring 150 instances from 17 repositories. Evaluating identical model-agent configurations across languages reveals a significant performance gap: while 70% of Python tasks in SWE-Bench Verified are solved, only 40% of our C# tasks are resolved. We open-source SWE-Sharp-Bench and our entire curation pipeline.

SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks

TL;DR

This work addresses the underrepresentation of C#/.NET in software-engineering benchmarks by introducing SWE-Sharp-Bench, a reproducible repository-level benchmark with 150 C# tasks from 17 projects. It builds on the SWE-Bench methodology, including automated environment generation, multi-stage PR-based curation, and manual verification, to enable cross-language evaluation using SWE-Agent and OpenHands across OpenAI and Anthropic models. The results reveal a substantial performance gap between Python and C#/Java, with C# tasks being more challenging due to patch breadth, multi-file edits, and language-specific tooling complexity; model choice also strongly drives success. The open-source curation pipeline and data enable ongoing benchmarking and analysis, providing a foundation for improving AI-assisted C# software engineering in enterprise contexts.

Abstract

AI coding agents have shown great progress on Python software engineering benchmarks like SWE-Bench, and for other languages like Java and C in benchmarks like Multi-SWE-Bench. However, C# -- a prominent enterprise language ranking #5 in the TIOBE index -- remains absent from such benchmarks. We introduce SWE-Sharp-Bench, a reproducible software engineering benchmark for C# featuring 150 instances from 17 repositories. Evaluating identical model-agent configurations across languages reveals a significant performance gap: while 70% of Python tasks in SWE-Bench Verified are solved, only 40% of our C# tasks are resolved. We open-source SWE-Sharp-Bench and our entire curation pipeline.

Paper Structure

This paper contains 21 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Curation pipeline
  • Figure 2: Distribution of #Files, #Hunks and #Lines in Patches across Languages
  • Figure 3: Logistic regression coefficients and their influence on resolution rate (baseline: C#, GPT-4o). Significance: *** p < 0.001, ** p < 0.01, * p < 0.05.
  • Figure 4: SWE-Sharp-Bench Curation Process
  • Figure 5: Prompt Template used for Issue Type Categorization
  • ...and 3 more figures