Table of Contents
Fetching ...

Code Representation Learning At Scale

Dejiao Zhang, Wasi Ahmad, Ming Tan, Hantian Ding, Ramesh Nallapati, Dan Roth, Xiaofei Ma, Bing Xiang

TL;DR

CodeSage addresses scalable code representation learning by employing a two-stage pretraining scheme: Stage I combines identifier deobfuscation with MLM that avoids the 80-10-10 masking to preserve code structure, and Stage II applies bimodal contrastive learning with hard negatives and hard positives to align natural language and code representations. Trained on The Stack across nine languages with three model sizes (130M, 356M, 1.3B), CodeSage achieves state-of-the-art-like gains on Code2Code and NL2Code semantic search and strong performance on code classification tasks, especially in cross-language settings. Ablation studies show that full masking with a 15% rate is superior to 80-10-10, that hard negatives/positives and bimodal CL are crucial for performance, and that the two-stage objective scales with model size, whereas contrastive learning alone does not. Overall, the work demonstrates the practical benefit of large-scale, multi-objective pretraining for code representations and provides detailed insights into what drives effective code embedding learning.

Abstract

Recent studies have shown that code language models at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size.

Code Representation Learning At Scale

TL;DR

CodeSage addresses scalable code representation learning by employing a two-stage pretraining scheme: Stage I combines identifier deobfuscation with MLM that avoids the 80-10-10 masking to preserve code structure, and Stage II applies bimodal contrastive learning with hard negatives and hard positives to align natural language and code representations. Trained on The Stack across nine languages with three model sizes (130M, 356M, 1.3B), CodeSage achieves state-of-the-art-like gains on Code2Code and NL2Code semantic search and strong performance on code classification tasks, especially in cross-language settings. Ablation studies show that full masking with a 15% rate is superior to 80-10-10, that hard negatives/positives and bimodal CL are crucial for performance, and that the two-stage objective scales with model size, whereas contrastive learning alone does not. Overall, the work demonstrates the practical benefit of large-scale, multi-objective pretraining for code representations and provides detailed insights into what drives effective code embedding learning.

Abstract

Recent studies have shown that code language models at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size.
Paper Structure (42 sections, 3 equations, 9 figures, 12 tables)

This paper contains 42 sections, 3 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: An overview of the key ingredients of CodeSage for code representation learning.
  • Figure 2: 80-10-10 vs. "Full Mask".
  • Figure 3: (a) Hard negative and hard positive can independently boost performance over the baseline where neither is applied. Further improvement is attained when leveraging them simultaneously. (b) Unimodal contrastive learning with positives obtained via dropout requires longer training and hence cannot leverage vast amounts of training data to further enhance the representations.
  • Figure 4: Examining the effectiveness of contrastive learning (Stage-II) by comparing CodeSage against those trained with the token-level denoising objective only (Stage-I). (a) Compared to the in-language Code2Code search, contrastive learning persistently leads to a larger performance boost for cross-lingual search, including both NL2Code and cross-language Code2Code search. (b) Contrastive learning leads to more dispersed representation space with improved discrimination, as indicated by the corresponding enlarged similarity gap between parallel and randomly sampled pairs, while simultaneously bridging the relative similarity gap between NL2Code and Code2Code pairs.
  • Figure 5: On the downstream task performance scaling with pretrained model size under different training schemes.
  • ...and 4 more figures