Table of Contents
Fetching ...

CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming

Ali TehraniJamsaz, Arijit Bhattacharjee, Le Chen, Nesreen K. Ahmed, Amir Yazdanbakhsh, Ali Jannesari

TL;DR

This paper introduces CodeRosetta, an encoder-decoder transformer model designed specifically for translating between programming languages and their HPC extensions and exhibits proficiency in Fortran to parallel C++ translation, marking it, to the authors' knowledge, as the first encoder-decoder model for this complex task.

Abstract

Recent advancements in Large Language Models (LLMs) have renewed interest in automatic programming language translation. Encoder-decoder transformer models, in particular, have shown promise in translating between different programming languages. However, translating between a language and its high-performance computing (HPC) extensions remains underexplored due to challenges such as complex parallel semantics. In this paper, we introduce CodeRosetta, an encoder-decoder transformer model designed specifically for translating between programming languages and their HPC extensions. CodeRosetta is evaluated on C++ to CUDA and Fortran to C++ translation tasks. It uses a customized learning framework with tailored pretraining and training objectives to effectively capture both code semantics and parallel structural nuances, enabling bidirectional translation. Our results show that CodeRosetta outperforms state-of-the-art baselines in C++ to CUDA translation by 2.9 BLEU and 1.72 CodeBLEU points while improving compilation accuracy by 6.05%. Compared to general closed-source LLMs, our method improves C++ to CUDA translation by 22.08 BLEU and 14.39 CodeBLEU, with 2.75% higher compilation accuracy. Finally, CodeRosetta exhibits proficiency in Fortran to parallel C++ translation, marking it, to our knowledge, as the first encoder-decoder model for this complex task, improving CodeBLEU by at least 4.63 points compared to closed-source and open-code LLMs.

CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming

TL;DR

This paper introduces CodeRosetta, an encoder-decoder transformer model designed specifically for translating between programming languages and their HPC extensions and exhibits proficiency in Fortran to parallel C++ translation, marking it, to the authors' knowledge, as the first encoder-decoder model for this complex task.

Abstract

Recent advancements in Large Language Models (LLMs) have renewed interest in automatic programming language translation. Encoder-decoder transformer models, in particular, have shown promise in translating between different programming languages. However, translating between a language and its high-performance computing (HPC) extensions remains underexplored due to challenges such as complex parallel semantics. In this paper, we introduce CodeRosetta, an encoder-decoder transformer model designed specifically for translating between programming languages and their HPC extensions. CodeRosetta is evaluated on C++ to CUDA and Fortran to C++ translation tasks. It uses a customized learning framework with tailored pretraining and training objectives to effectively capture both code semantics and parallel structural nuances, enabling bidirectional translation. Our results show that CodeRosetta outperforms state-of-the-art baselines in C++ to CUDA translation by 2.9 BLEU and 1.72 CodeBLEU points while improving compilation accuracy by 6.05%. Compared to general closed-source LLMs, our method improves C++ to CUDA translation by 22.08 BLEU and 14.39 CodeBLEU, with 2.75% higher compilation accuracy. Finally, CodeRosetta exhibits proficiency in Fortran to parallel C++ translation, marking it, to our knowledge, as the first encoder-decoder model for this complex task, improving CodeBLEU by at least 4.63 points compared to closed-source and open-code LLMs.

Paper Structure

This paper contains 37 sections, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Masked Language Modeling (MLM) pretraining steps in CodeRosetta.
  • Figure 2: Abstract Syntax Tree Entity Recognition pretraining steps in CodeRosetta.
  • Figure 3: Denoising Auto Encoding.
  • Figure 4: Back Translation.
  • Figure 5: Prompt for determining the quality of C++ source code
  • ...and 12 more figures