Table of Contents
Fetching ...

Top General Performance = Top Domain Performance? DomainCodeBench: A Multi-domain Code Generation Benchmark

Dewu Zheng, Yanlin Wang, Ensheng Shi, Xilin Liu, Yuchi Ma, Hongyu Zhang, Zibin Zheng

TL;DR

DomainCodeBench addresses the lack of domain-specific evaluation for code generation by introducing a multi-domain benchmark spanning 12 domains and 15 languages with 2,400 tasks and dependency-enriched prompts. The study reveals significant decoupling between general-domain and domain-specific performance, highlights common domain-specific failure modes, and demonstrates that domain-context augmentation can yield substantial improvements (≈38.17%). It also presents a rigorous construction methodology (domain mining, manual docstrings, dependency analysis) and evaluates repository-level augmentation strategies, offering practical guidance for practitioners and researchers. The replication package and open dataset aim to enable broader adoption and further advances in domain-aware code generation.

Abstract

With the rapid advancement of large language models (LLMs), extensive research has been conducted to investigate the code generation capabilities of LLMs. However, existing efforts primarily focus on general-domain tasks, leaving LLMs' code generation performance in real-world application domains underexplored. This raises a critical question: can a model's general-domain coding ability reliably represent its ability in specialized domains? In this paper, we introduce DomainCodeBench, a multi-domain code generation benchmark designed to systematically evaluate LLMs across 12 software application domains and 15 programming languages. DomainCodeBench contains 2,400 manually verified tasks with ground truth, human-annotated docstrings, and fine-grained dependency information to ensure more coverage of domain-specific challenges. Specifically, we first identify the most popular application domains by topic mining. Then, we curate coding tasks based on commonly used frameworks and platforms in each domain. We obtain several findings through extensive experiments on DomainCodeBench with ten mainstream LLMs. (1) Performance decoupling: experiments reveal that top general-domain models do not consistently excel in specific application domains; (2) Domain-specific weaknesses: LLMs often fail due to domain knowledge gaps and third-party library misusage; (3) Contextual enhancement: we show that augmenting prompts with domain-specific knowledge improves performance by around 38.17%, providing actionable insights for performance optimization. Our replication package, including the benchmark, source code, and experimental results, is available at https://github.com/DeepSoftwareAnalytics/DomainCodeBench.

Top General Performance = Top Domain Performance? DomainCodeBench: A Multi-domain Code Generation Benchmark

TL;DR

DomainCodeBench addresses the lack of domain-specific evaluation for code generation by introducing a multi-domain benchmark spanning 12 domains and 15 languages with 2,400 tasks and dependency-enriched prompts. The study reveals significant decoupling between general-domain and domain-specific performance, highlights common domain-specific failure modes, and demonstrates that domain-context augmentation can yield substantial improvements (≈38.17%). It also presents a rigorous construction methodology (domain mining, manual docstrings, dependency analysis) and evaluates repository-level augmentation strategies, offering practical guidance for practitioners and researchers. The replication package and open dataset aim to enable broader adoption and further advances in domain-aware code generation.

Abstract

With the rapid advancement of large language models (LLMs), extensive research has been conducted to investigate the code generation capabilities of LLMs. However, existing efforts primarily focus on general-domain tasks, leaving LLMs' code generation performance in real-world application domains underexplored. This raises a critical question: can a model's general-domain coding ability reliably represent its ability in specialized domains? In this paper, we introduce DomainCodeBench, a multi-domain code generation benchmark designed to systematically evaluate LLMs across 12 software application domains and 15 programming languages. DomainCodeBench contains 2,400 manually verified tasks with ground truth, human-annotated docstrings, and fine-grained dependency information to ensure more coverage of domain-specific challenges. Specifically, we first identify the most popular application domains by topic mining. Then, we curate coding tasks based on commonly used frameworks and platforms in each domain. We obtain several findings through extensive experiments on DomainCodeBench with ten mainstream LLMs. (1) Performance decoupling: experiments reveal that top general-domain models do not consistently excel in specific application domains; (2) Domain-specific weaknesses: LLMs often fail due to domain knowledge gaps and third-party library misusage; (3) Contextual enhancement: we show that augmenting prompts with domain-specific knowledge improves performance by around 38.17%, providing actionable insights for performance optimization. Our replication package, including the benchmark, source code, and experimental results, is available at https://github.com/DeepSoftwareAnalytics/DomainCodeBench.

Paper Structure

This paper contains 20 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: An example task instance in DomainCodeBench.
  • Figure 2: DomainCodeBench construction pipeline.
  • Figure 3: Comparison of code generation performance in different domains for LLMs with similar performance on HumanEval.
  • Figure 4: Case studies.