Table of Contents
Fetching ...

M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng Chai, Yanan Wu, Ke Jin, Ge Zhang, Zekun Wang, Guoan Zhang, Bangyu Xiang, Wenbo Su, Bo Zheng

TL;DR

This work tackles the limitation of repository-level code completion benchmarks by introducing M2rc-Eval, a massively multilingual benchmark spanning 18 languages with AST-derived bucket-level and semantic-level annotations, plus M2rc-Instruct for multilingual instruction tuning. It leverages a large-scale data pipeline from The Stack v2, applies rigorous quality controls, and provides detailed annotation schemas to enable fine-grained analysis of code LLMs across languages and semantics. Through extensive experiments with retrieval-based context augmentation and instruction fine-tuning, the study shows cross-file context significantly improves performance and that multilingual instruction data can boost even smaller models, with Python-only tuning generalizing well to other languages. Overall, M2rc-Eval and M2rc-Instruct offer a comprehensive framework for evaluating and improving repository-level code intelligence in a multilingual setting, with practical implications for future benchmarks and model development.

Abstract

Repository-level code completion has drawn great attention in software engineering, and several benchmark datasets have been introduced. However, existing repository-level code completion benchmarks usually focus on a limited number of languages (<5), which cannot evaluate the general code intelligence abilities across different languages for existing code Large Language Models (LLMs). Besides, the existing benchmarks usually report overall average scores of different languages, where the fine-grained abilities in different completion scenarios are ignored. Therefore, to facilitate the research of code LLMs in multilingual scenarios, we propose a massively multilingual repository-level code completion benchmark covering 18 programming languages (called M2RC-EVAL), and two types of fine-grained annotations (i.e., bucket-level and semantic-level) on different completion scenarios are provided, where we obtain these annotations based on the parsed abstract syntax tree. Moreover, we also curate a massively multilingual instruction corpora M2RC- INSTRUCT dataset to improve the repository-level code completion abilities of existing code LLMs. Comprehensive experimental results demonstrate the effectiveness of our M2RC-EVAL and M2RC-INSTRUCT.

M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

TL;DR

This work tackles the limitation of repository-level code completion benchmarks by introducing M2rc-Eval, a massively multilingual benchmark spanning 18 languages with AST-derived bucket-level and semantic-level annotations, plus M2rc-Instruct for multilingual instruction tuning. It leverages a large-scale data pipeline from The Stack v2, applies rigorous quality controls, and provides detailed annotation schemas to enable fine-grained analysis of code LLMs across languages and semantics. Through extensive experiments with retrieval-based context augmentation and instruction fine-tuning, the study shows cross-file context significantly improves performance and that multilingual instruction data can boost even smaller models, with Python-only tuning generalizing well to other languages. Overall, M2rc-Eval and M2rc-Instruct offer a comprehensive framework for evaluating and improving repository-level code intelligence in a multilingual setting, with practical implications for future benchmarks and model development.

Abstract

Repository-level code completion has drawn great attention in software engineering, and several benchmark datasets have been introduced. However, existing repository-level code completion benchmarks usually focus on a limited number of languages (<5), which cannot evaluate the general code intelligence abilities across different languages for existing code Large Language Models (LLMs). Besides, the existing benchmarks usually report overall average scores of different languages, where the fine-grained abilities in different completion scenarios are ignored. Therefore, to facilitate the research of code LLMs in multilingual scenarios, we propose a massively multilingual repository-level code completion benchmark covering 18 programming languages (called M2RC-EVAL), and two types of fine-grained annotations (i.e., bucket-level and semantic-level) on different completion scenarios are provided, where we obtain these annotations based on the parsed abstract syntax tree. Moreover, we also curate a massively multilingual instruction corpora M2RC- INSTRUCT dataset to improve the repository-level code completion abilities of existing code LLMs. Comprehensive experimental results demonstrate the effectiveness of our M2RC-EVAL and M2RC-INSTRUCT.

Paper Structure

This paper contains 21 sections, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Overview of our proposed M$^2$rc-Eval with 18 languages. Specifically, first, we provide three samples from different languages (i.e., Python, Java, TypeScript) for illustration, where the bucket label and semantic label for the corresponding cursor position are provided. Second, the code LLMs need to predict the completion results given the in-file context from the current code file and the cross file context retrieved from other code files in the current repository. Note that "$<\mathrm{INFILLING}>$" denotes that the current position will be triggered for code completion.
  • Figure 2: Illustration on generating completion cursor position and fine-grained annotations. Specifically, we first parse the source code into an abstract syntax tree (AST). Then, we choose one node as the completion cursor position and generate the bucket label based on the belonged layer number in AST, and obtain the semantic label based on the node type parsed by the Tree-sitter.
  • Figure 3: The average prompt length (100x tokens), completion span length (50x tokens), and cross-file dependencies (1x) in the testing set of M$^2$rc-Eval. We define the number of other files, which are explicitly imported and implicitly referenced by the current file, as cross-file dependencies.
  • Figure 4: Semantic-level annotations on different types of programming languages.
  • Figure 5: Effectiveness of using different training data sizes.
  • ...and 11 more figures