Table of Contents
Fetching ...

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, Nghi D. Q. Bui

TL;DR

The Vault introduces a multilingual, high-quality code-text dataset built from The Stack, totaling tens of millions of paired samples across ten languages. It combines large-scale raw code with a rigorous cleaning pipeline—comprising 13 rule-based filters and a neural CodeBERT-based consistency classifier—to produce high-quality code-text pairs and rich docstring metadata. Empirical evaluations show that fine-tuning CodeLLMs on The Vault improves performance on code summarization, code search, and code generation benchmarks compared with CodeSearchNet and raw-stack baselines, and the authors release an open-source toolkit for extraction and filtering. The dataset’s scale, language coverage, and metadata-rich annotations support better generalization and offer a valuable resource for advancing state-of-the-art code understanding and generation, while acknowledging limitations and outlining plans for broader language support and larger models.

Abstract

We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

TL;DR

The Vault introduces a multilingual, high-quality code-text dataset built from The Stack, totaling tens of millions of paired samples across ten languages. It combines large-scale raw code with a rigorous cleaning pipeline—comprising 13 rule-based filters and a neural CodeBERT-based consistency classifier—to produce high-quality code-text pairs and rich docstring metadata. Empirical evaluations show that fine-tuning CodeLLMs on The Vault improves performance on code summarization, code search, and code generation benchmarks compared with CodeSearchNet and raw-stack baselines, and the authors release an open-source toolkit for extraction and filtering. The dataset’s scale, language coverage, and metadata-rich annotations support better generalization and offer a valuable resource for advancing state-of-the-art code understanding and generation, while acknowledging limitations and outlining plans for broader language support and larger models.

Abstract

We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.
Paper Structure (29 sections, 10 figures, 14 tables)

This paper contains 29 sections, 10 figures, 14 tables.

Figures (10)

  • Figure 1: The tree-sitter node structure. Classes ($1$) and functions ($3$) are extracted along with their corresponding docstring, which may be in the form of a block comment ($2$) or a line comment ($5$). The line comments ($5$) are extracted along with their preceding ($4a$) and succeeding ($4b$) code nodes for the inline dataset.
  • Figure 2: Pipeline to create datasets of code blocks with comments $D_{block}$, unimodal code $D_{unimodal}$, and code-text pairs $D_{paired}$ from raw source code.
  • Figure 3: Input representation and Negative sample generation for code-docstring inconsistency detection.
  • Figure 4: Distribution and the number of functions by the presence of docstrings. Functions with docstrings are further divided into two categories: functions removed by rule-based filters and functions in the final code-text dataset.
  • Figure 5: Code and Docstring tokens length distribution. The plot shows the lower to upper quartile values of the number of tokens in the data. The orange solid line $|$ indicates the median and the green triangle ▲ presents the mean.
  • ...and 5 more figures