Table of Contents
Fetching ...

Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository

Ajinkya Deshpande, Anmol Agarwal, Shashank Shet, Arun Iyer, Aditya Kanade, Ramakrishna Bairi, Suresh Parthasarathy

TL;DR

This work introduces RepoClassBench, a benchmark for generating complete classes within real software repositories across Java, Python, and C#, emphasizing cross-file dependencies and test verification. It proposes Retrieve-Repotools-Reflect (RRR), an agent-based framework that augments LLMs with static analysis tools to iteratively explore repository context, guided by oracle test feedback. Empirical results show RRR significantly outperforms baselines that rely on raw prompts or simple retrieval across multiple languages and prompt Granularities, underscoring the importance of repository-aware context and dependency-aware retrieval. The findings advocate for repo-level evaluation in code-generation benchmarks and demonstrate the practical value of tool-enabled reasoning for aligning generated code with complex software environments.

Abstract

LLMs have demonstrated significant potential in code generation tasks, achieving promising results at the function or statement level across various benchmarks. However, the complexities associated with creating code artifacts like classes, particularly within the context of real-world software repositories, remain underexplored. Prior research treats class-level generation as an isolated task, neglecting the intricate dependencies & interactions that characterize real-world software environments. To address this gap, we introduce RepoClassBench, a comprehensive benchmark designed to rigorously evaluate LLMs in generating complex, class-level code within real-world repositories. RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories. We ensure that each class in our dataset not only has cross-file dependencies within the repository but also includes corresponding test cases to verify its functionality. We find that current models struggle with the realistic challenges posed by our benchmark, primarily due to their limited exposure to relevant repository contexts. To address this shortcoming, we introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context in an agent-based framework. Our experiments demonstrate that RRR significantly outperforms existing baselines on RepoClassBench, showcasing its effectiveness across programming languages & under various settings. Our findings emphasize the critical need for code-generation benchmarks to incorporate repo-level dependencies to more accurately reflect the complexities of software development. Our work shows the benefits of leveraging specialized tools to enhance LLMs' understanding of repository context. We plan to make our dataset & evaluation harness public.

Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository

TL;DR

This work introduces RepoClassBench, a benchmark for generating complete classes within real software repositories across Java, Python, and C#, emphasizing cross-file dependencies and test verification. It proposes Retrieve-Repotools-Reflect (RRR), an agent-based framework that augments LLMs with static analysis tools to iteratively explore repository context, guided by oracle test feedback. Empirical results show RRR significantly outperforms baselines that rely on raw prompts or simple retrieval across multiple languages and prompt Granularities, underscoring the importance of repository-aware context and dependency-aware retrieval. The findings advocate for repo-level evaluation in code-generation benchmarks and demonstrate the practical value of tool-enabled reasoning for aligning generated code with complex software environments.

Abstract

LLMs have demonstrated significant potential in code generation tasks, achieving promising results at the function or statement level across various benchmarks. However, the complexities associated with creating code artifacts like classes, particularly within the context of real-world software repositories, remain underexplored. Prior research treats class-level generation as an isolated task, neglecting the intricate dependencies & interactions that characterize real-world software environments. To address this gap, we introduce RepoClassBench, a comprehensive benchmark designed to rigorously evaluate LLMs in generating complex, class-level code within real-world repositories. RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories. We ensure that each class in our dataset not only has cross-file dependencies within the repository but also includes corresponding test cases to verify its functionality. We find that current models struggle with the realistic challenges posed by our benchmark, primarily due to their limited exposure to relevant repository contexts. To address this shortcoming, we introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context in an agent-based framework. Our experiments demonstrate that RRR significantly outperforms existing baselines on RepoClassBench, showcasing its effectiveness across programming languages & under various settings. Our findings emphasize the critical need for code-generation benchmarks to incorporate repo-level dependencies to more accurately reflect the complexities of software development. Our work shows the benefits of leveraging specialized tools to enhance LLMs' understanding of repository context. We plan to make our dataset & evaluation harness public.
Paper Structure (34 sections, 1 equation, 4 figures, 23 tables, 1 algorithm)

This paper contains 34 sections, 1 equation, 4 figures, 23 tables, 1 algorithm.

Figures (4)

  • Figure 1: The dataset creation pipeline involved shortlisting candidate repositories, noting passing test cases, finding classes covered by passing test cases (which make external references) and finally mitigating memorization issues if necessary, using paraphrasing.
  • Figure 2: Flowchart illustrating the procedural framework of RRR. RRR utilizes the natural language description of the class and outputs of independent tools to create an initial attempt. This attempt is evaluated by an oracle, pinpointing specific errors. Subsequently, RRR uses repository tools to gather information to rectify the errors. It then reflects on feedback and tool insights to refine the attempt. This iterative cycle persists until either all test cases pass or the maximum allowed number of oracle calls is reached.
  • Figure 3: Distribution of the tasks across the various repositories in the Python dataset.
  • Figure 4: Distribution of the tasks across the various repositories in the Java dataset.