Table of Contents
Fetching ...

DependEval: Benchmarking LLMs for Repository Dependency Understanding

Junjia Du, Yadi Liu, Hongcheng Guo, Jiawei Wang, Haojian Huang, Yunyi Ni, Zhoujun Li

TL;DR

DependEval introduces a hierarchical, multilingual benchmark to evaluate LLMs on repository-level understanding across eight languages and three tasks: Dependency Recognition, Repository Construction, and Multi-file Editing. It assembles 15,576 real-world GitHub repositories, constructs inter-file snippets via dependency analysis, and employs ground-truth graphs and human-in-the-loop curation to create gold standards. Through an extensive evaluation of over 25 models, the study reveals that model size and domain-specific pretraining influence performance, with notable gaps in cross-file editing and directory-structure reasoning, and highlights the benefits of instruction-tuning for software-engineering tasks. The benchmark provides insights into how current LLMs handle repository-level code reasoning and offers a framework for future extensions and improvements in repository-aware code understanding. The work has practical impact by guiding model development toward better tooling for large-scale codebases and by offering a resource for rigorous, real-world evaluation of repository intelligence.

Abstract

While large language models (LLMs) have shown considerable promise in code generation, real-world software development demands advanced repository-level reasoning. This includes understanding dependencies, project structures, and managing multi-file changes. However, the ability of LLMs to effectively comprehend and handle complex code repositories has yet to be fully explored. To address challenges, we introduce a hierarchical benchmark designed to evaluate repository dependency understanding (DependEval). Benchmark is based on 15,576 repositories collected from real-world websites. It evaluates models on three core tasks: Dependency Recognition, Repository Construction, and Multi-file Editing, across 8 programming languages from actual code repositories. Our evaluation of over 25 LLMs reveals substantial performance gaps and provides valuable insights into repository-level code understanding.

DependEval: Benchmarking LLMs for Repository Dependency Understanding

TL;DR

DependEval introduces a hierarchical, multilingual benchmark to evaluate LLMs on repository-level understanding across eight languages and three tasks: Dependency Recognition, Repository Construction, and Multi-file Editing. It assembles 15,576 real-world GitHub repositories, constructs inter-file snippets via dependency analysis, and employs ground-truth graphs and human-in-the-loop curation to create gold standards. Through an extensive evaluation of over 25 models, the study reveals that model size and domain-specific pretraining influence performance, with notable gaps in cross-file editing and directory-structure reasoning, and highlights the benefits of instruction-tuning for software-engineering tasks. The benchmark provides insights into how current LLMs handle repository-level code reasoning and offers a framework for future extensions and improvements in repository-aware code understanding. The work has practical impact by guiding model development toward better tooling for large-scale codebases and by offering a resource for rigorous, real-world evaluation of repository intelligence.

Abstract

While large language models (LLMs) have shown considerable promise in code generation, real-world software development demands advanced repository-level reasoning. This includes understanding dependencies, project structures, and managing multi-file changes. However, the ability of LLMs to effectively comprehend and handle complex code repositories has yet to be fully explored. To address challenges, we introduce a hierarchical benchmark designed to evaluate repository dependency understanding (DependEval). Benchmark is based on 15,576 repositories collected from real-world websites. It evaluates models on three core tasks: Dependency Recognition, Repository Construction, and Multi-file Editing, across 8 programming languages from actual code repositories. Our evaluation of over 25 LLMs reveals substantial performance gaps and provides valuable insights into repository-level code understanding.

Paper Structure

This paper contains 56 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of DependEval. It contains 3 tasks including Repository Construction, Dependency Recognition, and Multi-file Editing. The first task analyzes the project description to generate a structure showing the dependencies between files. The second task identifies the content of the files to determine the relationships between them. Finally, the third task modifies the code to add functionality for saving displayed images to disk.
  • Figure 2: Pipeline for data curation of DependEval. It consists of four steps: data crawling and filtering (Step 1), dependency snippet generation (Step 2), test sample generation for dependency recognition (Step 3), and evaluation using metrics like Exact Match and Graph Match F1 Score (Step 4).
  • Figure 3: Length Distribution of DependEval.
  • Figure 4: Results of different tasks in DependEval. The radar charts show the performance of various models across Dependency Recognition(a), Repository Construction(b), and Multi-file Editing(c) tasks. Each line represents a different model, with performance measured for different programming languages.
  • Figure 5: Instruction-following performance covering 8 different languages.
  • ...and 2 more figures