Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation
Musfiqur Rahman, SayedHassan Khatoonabadi, Emad Shihab
TL;DR
This work reveals a substantial gap between synthetic benchmark performance and real-world class-level code generation by LLMs. By constructing RealClassEval with seen and unseen partitions and evaluating seven diverse LLMs across docstring and RAG configurations, the study shows real-world class-level correctness plummets from ~84–89% on synthetic tasks to ~25–34% on real-world tasks, with minimal difference between seen and unseen data. Docstrings yield only modest, often non-significant improvements; however, retrieval augmentation provides meaningful gains—4–7%—particularly when documentation is incomplete, supporting an information gap hypothesis. Error analysis identifies AttributeError, TypeError, and AssertionError as the dominant failure modes, with distinct patterns between synthetic and real-world settings and RAG-induced error substitutions that reduce some errors but introduce dependency-related ones. The findings suggest targeted improvements in object-oriented semantic understanding and context-aware retrieval, and they argue for more realistic, multi-faceted benchmarks to guide production-ready code-generation tools.
Abstract
Large language models (LLMs) have demonstrated strong performance on function-level code generation benchmarks, yet real-world software development increasingly demands class-level implementations that integrate multiple methods, attributes, and dependencies within authentic project contexts. This gap between benchmark performance and practical utility raises critical questions about LLMs' readiness for production code assistance, particularly regarding their ability to generalize across familiar and novel codebases. We introduce a benchmark derived from real-world open-source repositories, comprising classes divided into seen and unseen partitions to evaluate generalization under practical conditions. We systematically examine how input specification completeness and retrieval-augmented generation affect class-level correctness across multiple state-of-the-art LLMs. Our evaluation reveals a substantial performance gap: while LLMs achieve 84 to 89% correctness on synthetic benchmarks, they attain only 25 to 34% on real-world class tasks, with minimal distinction between familiar and novel codebases. Comprehensive documentation provides marginal improvements (1 to 3%), whereas retrieval augmentation yields greater gains (4 to 7%) by supplying concrete implementation patterns. Error analysis identifies AttributeError, TypeError, and AssertionError as dominant failure modes, with distinct patterns between synthetic and real-world scenarios. These findings provide actionable insights for enhancing context modelling, documentation strategies, and retrieval integration in production code assistance tools.
