DependEval: Benchmarking LLMs for Repository Dependency Understanding

Junjia Du; Yadi Liu; Hongcheng Guo; Jiawei Wang; Haojian Huang; Yunyi Ni; Zhoujun Li

DependEval: Benchmarking LLMs for Repository Dependency Understanding

Junjia Du, Yadi Liu, Hongcheng Guo, Jiawei Wang, Haojian Huang, Yunyi Ni, Zhoujun Li

TL;DR

DependEval introduces a hierarchical, multilingual benchmark to evaluate LLMs on repository-level understanding across eight languages and three tasks: Dependency Recognition, Repository Construction, and Multi-file Editing. It assembles 15,576 real-world GitHub repositories, constructs inter-file snippets via dependency analysis, and employs ground-truth graphs and human-in-the-loop curation to create gold standards. Through an extensive evaluation of over 25 models, the study reveals that model size and domain-specific pretraining influence performance, with notable gaps in cross-file editing and directory-structure reasoning, and highlights the benefits of instruction-tuning for software-engineering tasks. The benchmark provides insights into how current LLMs handle repository-level code reasoning and offers a framework for future extensions and improvements in repository-aware code understanding. The work has practical impact by guiding model development toward better tooling for large-scale codebases and by offering a resource for rigorous, real-world evaluation of repository intelligence.

Abstract

While large language models (LLMs) have shown considerable promise in code generation, real-world software development demands advanced repository-level reasoning. This includes understanding dependencies, project structures, and managing multi-file changes. However, the ability of LLMs to effectively comprehend and handle complex code repositories has yet to be fully explored. To address challenges, we introduce a hierarchical benchmark designed to evaluate repository dependency understanding (DependEval). Benchmark is based on 15,576 repositories collected from real-world websites. It evaluates models on three core tasks: Dependency Recognition, Repository Construction, and Multi-file Editing, across 8 programming languages from actual code repositories. Our evaluation of over 25 LLMs reveals substantial performance gaps and provides valuable insights into repository-level code understanding.

DependEval: Benchmarking LLMs for Repository Dependency Understanding

TL;DR

Abstract

DependEval: Benchmarking LLMs for Repository Dependency Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)