Table of Contents
Fetching ...

ComPile: A Large IR Dataset from Production Sources

Aiden Grossman, Ludger Paehler, Konstantinos Parasyris, Tal Ben-Nun, Jacob Hegna, William Moses, Jose M Monsalve Diaz, Mircea Trofin, Johannes Doerfert

TL;DR

ComPile tackles the scarcity of large-scale, production-grade intermediate representations by constructing a 2.8TB textual LLVM-IR dataset from Rust, Julia, Swift, and C/C++ using ecosystem-specific builders and LLVM-IR extraction prior to optimization. The authors implement a scalable workflow with deduplication and permissive license filtering, and perform extensive statistical analyses to demonstrate IR properties, duplication patterns, and tokenization behavior across languages. This resource enables pretraining and fine-tuning of IR-aware models and potentially learned compiler components, offering a pathway to improved code generation, optimization, and performance prediction within compiler infrastructure. By open-sourcing tooling and workflows, ComPile lowers the barrier to IR-centric machine learning research and fosters reproducible, scalable exploration of compiler-oriented ML methods.

Abstract

Code is increasingly becoming a core data modality of modern machine learning research impacting not only the way we write code with conversational agents like OpenAI's ChatGPT, Google's Bard, or Anthropic's Claude, the way we translate code from one language into another, but also the compiler infrastructure underlying the language. While modeling approaches may vary and representations differ, the targeted tasks often remain the same within the individual classes of models. Relying solely on the ability of modern models to extract information from unstructured code does not take advantage of 70 years of programming language and compiler development by not utilizing the structure inherent to programs in the data collection. This detracts from the performance of models working over a tokenized representation of input code and precludes the use of these models in the compiler itself. To work towards the first intermediate representation (IR) based models, we fully utilize the LLVM compiler infrastructure, shared by a number of languages, to generate a 182B token dataset of LLVM IR. We generated this dataset from programming languages built on the shared LLVM infrastructure, including Rust, Swift, Julia, and C/C++, by hooking into LLVM code generation either through the language's package manager or the compiler directly to extract the dataset of intermediate representations from production grade programs. Statistical analysis proves the utility of our dataset not only for large language model training, but also for the introspection into the code generation process itself with the dataset showing great promise for machine-learned compiler components.

ComPile: A Large IR Dataset from Production Sources

TL;DR

ComPile tackles the scarcity of large-scale, production-grade intermediate representations by constructing a 2.8TB textual LLVM-IR dataset from Rust, Julia, Swift, and C/C++ using ecosystem-specific builders and LLVM-IR extraction prior to optimization. The authors implement a scalable workflow with deduplication and permissive license filtering, and perform extensive statistical analyses to demonstrate IR properties, duplication patterns, and tokenization behavior across languages. This resource enables pretraining and fine-tuning of IR-aware models and potentially learned compiler components, offering a pathway to improved code generation, optimization, and performance prediction within compiler infrastructure. By open-sourcing tooling and workflows, ComPile lowers the barrier to IR-centric machine learning research and fosters reproducible, scalable exploration of compiler-oriented ML methods.

Abstract

Code is increasingly becoming a core data modality of modern machine learning research impacting not only the way we write code with conversational agents like OpenAI's ChatGPT, Google's Bard, or Anthropic's Claude, the way we translate code from one language into another, but also the compiler infrastructure underlying the language. While modeling approaches may vary and representations differ, the targeted tasks often remain the same within the individual classes of models. Relying solely on the ability of modern models to extract information from unstructured code does not take advantage of 70 years of programming language and compiler development by not utilizing the structure inherent to programs in the data collection. This detracts from the performance of models working over a tokenized representation of input code and precludes the use of these models in the compiler itself. To work towards the first intermediate representation (IR) based models, we fully utilize the LLVM compiler infrastructure, shared by a number of languages, to generate a 182B token dataset of LLVM IR. We generated this dataset from programming languages built on the shared LLVM infrastructure, including Rust, Swift, Julia, and C/C++, by hooking into LLVM code generation either through the language's package manager or the compiler directly to extract the dataset of intermediate representations from production grade programs. Statistical analysis proves the utility of our dataset not only for large language model training, but also for the introspection into the code generation process itself with the dataset showing great promise for machine-learned compiler components.
Paper Structure (24 sections, 5 figures, 5 tables)

This paper contains 24 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Size distribution of LLVM intermediate representation (IR) bitcode within ComPile before de-duplication within and among languages.
  • Figure 2: Individual components of the dataset collection tooling. (Curated Sources) The set of sources comprised of package indices, and selected packages, ingested by the ComPile Dataset Generation Pipeline. (Sources) acquire the source based upon the provided package list, before the (Builders) built the package, and it is then filtered, deduplicated, and its build process documented in the (IR Extraction) to arrive at the dataset.
  • Figure 3: Histograms of 8 different function properties. All function properties are analyzed across all 5 languages, and show a similar left-skew in their count-statistics.
  • Figure 4: Percentage of duplicate functions present between two languages as determined by the newly upstreamd cStructuralHash in LLVM with detailed hashing enabled. All values are percentages.
  • Figure 5: LLVM IR opcode distribution of the top ten operations across all languages included in ComPile as computed by LLVM's cInstCount pass.