Table of Contents
Fetching ...

MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks

Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Mihai Capota, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

TL;DR

Results suggest that MonoCoderunderstands HPC code better than state-of-the-art LLMs, and outperforms other LLMs on normalized-perplexity tests (in relation to model size) while also delivering competing CodeBLEU scores for high-performance and parallel code generations.

Abstract

With easier access to powerful compute resources, there is a growing trend in AI for software development to develop large language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because LLMs for HPC tasks are obtained by finetuning existing LLMs that support several natural and/or programming languages. We found this design choice confusing - why do we need LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question choices made by existing LLMs by developing smaller language models (LMs) for specific domains - we call them domain-specific LMs. Specifically, we start with HPC as a domain and build an HPC-specific LM, named MonoCoder, which is orders of magnitude smaller than existing LMs but delivers better performance on non-HPC and HPC codes. Specifically, we pre-trained MonoCoder on an HPC-specific dataset (named HPCorpus) of C and C++ programs mined from GitHub. We evaluated the performance of MonoCoder against state-of-the-art multi-lingual LLMs. Results demonstrate that MonoCoder, although much smaller than existing LMs, outperforms other LLMs on normalized-perplexity tests (in relation to model size) while also delivering competing CodeBLEU scores for high-performance and parallel code generations. In other words, results suggest that MonoCoder understands HPC code better than state-of-the-art LLMs.

MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks

TL;DR

Results suggest that MonoCoderunderstands HPC code better than state-of-the-art LLMs, and outperforms other LLMs on normalized-perplexity tests (in relation to model size) while also delivering competing CodeBLEU scores for high-performance and parallel code generations.

Abstract

With easier access to powerful compute resources, there is a growing trend in AI for software development to develop large language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because LLMs for HPC tasks are obtained by finetuning existing LLMs that support several natural and/or programming languages. We found this design choice confusing - why do we need LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question choices made by existing LLMs by developing smaller language models (LMs) for specific domains - we call them domain-specific LMs. Specifically, we start with HPC as a domain and build an HPC-specific LM, named MonoCoder, which is orders of magnitude smaller than existing LMs but delivers better performance on non-HPC and HPC codes. Specifically, we pre-trained MonoCoder on an HPC-specific dataset (named HPCorpus) of C and C++ programs mined from GitHub. We evaluated the performance of MonoCoder against state-of-the-art multi-lingual LLMs. Results demonstrate that MonoCoder, although much smaller than existing LMs, outperforms other LLMs on normalized-perplexity tests (in relation to model size) while also delivering competing CodeBLEU scores for high-performance and parallel code generations. In other words, results suggest that MonoCoder understands HPC code better than state-of-the-art LLMs.
Paper Structure (9 sections, 4 figures, 1 table)

This paper contains 9 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Local Semantics Elimination (LSE) pipeline overview: Given a source code, the code turns into a semantic-less version using AST knowledge, and eventually, the lexicalized tokens are fed into MonoCoder.
  • Figure 2: Comparison of code language models based on their model size, perplexities, and normalized-to-size perplexities for C and C++. The results demonstrate that smaller models, such as MonoCoder, tend to have much better normalized perplexity scores (lower is better), indicating better performance relative to their size. Data for PolyCoder, GPT-Neo, GPT-J, Codex, StarCoder, and GPT-NeoX are taken from xu2022systematic and li2023starcoder.
  • Figure 3: Evaluating code completion performance of HPC code by the foundation models (PolyCoder example): Evaluating machine-generated code, given different contexts of the initial HPC codes and measuring the similarity to the true reference.
  • Figure 4: Code Completion Performance on General and OpenMP Datasets –- CodeBLEU scores (higher is better) for MonoCoder, PolyCoder, and GPT-3.5 models, both with and without Local Semantic Elimination (LSE), across varying context lengths (100, 300, and 600 tokens). MonoCoder and MonoCoder + LSE consistently outperform other models, with the addition of LSE generally enhancing performance across all models.