Table of Contents
Fetching ...

Data Augmentation for Code Translation with Comparable Corpora and Multiple References

Yiqing Xie, Atharva Naik, Daniel Fried, Carolyn Rose

TL;DR

Two data augmentation techniques are presented, one that builds comparable corpora (i.e., code pairs with similar functionality), and another that augments existing parallel data with multiple reference translations to reduce overfitting to a single reference translation.

Abstract

One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code pairs with similar functionality), and another that augments existing parallel data with multiple reference translations. Specifically, we build and analyze multiple types of comparable corpora, including programs generated from natural language documentation using a code generation model. Furthermore, to reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data and filter the translations by unit tests, which increases variation in target translations. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy (CA@1), which verifies the correctness of translations by execution. The code is available at https://github.com/Veronicium/CMTrans.

Data Augmentation for Code Translation with Comparable Corpora and Multiple References

TL;DR

Two data augmentation techniques are presented, one that builds comparable corpora (i.e., code pairs with similar functionality), and another that augments existing parallel data with multiple reference translations to reduce overfitting to a single reference translation.

Abstract

One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code pairs with similar functionality), and another that augments existing parallel data with multiple reference translations. Specifically, we build and analyze multiple types of comparable corpora, including programs generated from natural language documentation using a code generation model. Furthermore, to reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data and filter the translations by unit tests, which increases variation in target translations. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy (CA@1), which verifies the correctness of translations by execution. The code is available at https://github.com/Veronicium/CMTrans.
Paper Structure (20 sections, 1 equation, 11 figures, 7 tables)

This paper contains 20 sections, 1 equation, 11 figures, 7 tables.

Figures (11)

  • Figure 1: The standard pipeline for code translation and the pipeline of CMTrans . The comparable corpora are both naturally occurring and model generated. We generate multiple references by our method.
  • Figure 2: An example of parallel and comparable data. Parallel examples are line-by-line aligned. Programs in a comparable example may have different algorithms and structures (e.g., global code vs. class in this case), but may still contain lines that can be matched, as highlighted in pink, blue, and yellow.
  • Figure 3: An example of CMTrans for Java-to-Python translation. We compare the pipeline of CMTrans to the standard pipeline of code translation (e.g., finetuned CodeT5, codet5) and the self-supervision-and-fine-tuning method of TransCoder-ST (transcoderST).
  • Figure 4: Translation results with different amount of parallel data. We mark the relative gain of CMTrans over CodeT5.
  • Figure 5: Perplexity of validation set during finetuning.
  • ...and 6 more figures