SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks
Sanket Mhatre, Yasharth Bajpai, Sumit Gulwani, Emerson Murphy-Hill, Gustavo Soares
TL;DR
This work addresses the underrepresentation of C#/.NET in software-engineering benchmarks by introducing SWE-Sharp-Bench, a reproducible repository-level benchmark with 150 C# tasks from 17 projects. It builds on the SWE-Bench methodology, including automated environment generation, multi-stage PR-based curation, and manual verification, to enable cross-language evaluation using SWE-Agent and OpenHands across OpenAI and Anthropic models. The results reveal a substantial performance gap between Python and C#/Java, with C# tasks being more challenging due to patch breadth, multi-file edits, and language-specific tooling complexity; model choice also strongly drives success. The open-source curation pipeline and data enable ongoing benchmarking and analysis, providing a foundation for improving AI-assisted C# software engineering in enterprise contexts.
Abstract
AI coding agents have shown great progress on Python software engineering benchmarks like SWE-Bench, and for other languages like Java and C in benchmarks like Multi-SWE-Bench. However, C# -- a prominent enterprise language ranking #5 in the TIOBE index -- remains absent from such benchmarks. We introduce SWE-Sharp-Bench, a reproducible software engineering benchmark for C# featuring 150 instances from 17 repositories. Evaluating identical model-agent configurations across languages reveals a significant performance gap: while 70% of Python tasks in SWE-Bench Verified are solved, only 40% of our C# tasks are resolved. We open-source SWE-Sharp-Bench and our entire curation pipeline.
