Table of Contents
Fetching ...

SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation

Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, Xin Jiang

TL;DR

SynCoBERT tackles limitations of token-only code models by introducing syntax-aware and multi-modal pre-training. It jointly leverages symbolic identifiers, AST edges, and multi-modal data (code, comments, AST) through IP, TEP, and a multi-modal contrastive learning objective, all within a Transformer-based encoder. Pre-trained on CodeSearchNet, it achieves state-of-the-art performance across code search, clone detection, defect detection, and program translation, with ablations confirming the central role of MCL and the new objectives. The work demonstrates that integrating syntactic structure and cross-modal information yields more robust and transferable code representations for diverse downstream tasks.

Abstract

Code representation learning, which aims to encode the semantics of source code into distributed vectors, plays an important role in recent deep-learning-based models for code intelligence. Recently, many pre-trained language models for source code (e.g., CuBERT and CodeBERT) have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code search, code clone detection, and program translation. Current approaches typically consider the source code as a plain sequence of tokens, or inject the structure information (e.g., AST and data-flow) into the sequential model pre-training. To further explore the properties of programming languages, this paper proposes SynCoBERT, a syntax-guided multi-modal contrastive pre-training approach for better code representations. Specially, we design two novel pre-training objectives originating from the symbolic and syntactic properties of source code, i.e., Identifier Prediction (IP) and AST Edge Prediction (TEP), which are designed to predict identifiers, and edges between two nodes of AST, respectively. Meanwhile, to exploit the complementary information in semantically equivalent modalities (i.e., code, comment, AST) of the code, we propose a multi-modal contrastive learning strategy to maximize the mutual information among different modalities. Extensive experiments on four downstream tasks related to code intelligence show that SynCoBERT advances the state-of-the-art with the same pre-training corpus and model size.

SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation

TL;DR

SynCoBERT tackles limitations of token-only code models by introducing syntax-aware and multi-modal pre-training. It jointly leverages symbolic identifiers, AST edges, and multi-modal data (code, comments, AST) through IP, TEP, and a multi-modal contrastive learning objective, all within a Transformer-based encoder. Pre-trained on CodeSearchNet, it achieves state-of-the-art performance across code search, clone detection, defect detection, and program translation, with ablations confirming the central role of MCL and the new objectives. The work demonstrates that integrating syntactic structure and cross-modal information yields more robust and transferable code representations for diverse downstream tasks.

Abstract

Code representation learning, which aims to encode the semantics of source code into distributed vectors, plays an important role in recent deep-learning-based models for code intelligence. Recently, many pre-trained language models for source code (e.g., CuBERT and CodeBERT) have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code search, code clone detection, and program translation. Current approaches typically consider the source code as a plain sequence of tokens, or inject the structure information (e.g., AST and data-flow) into the sequential model pre-training. To further explore the properties of programming languages, this paper proposes SynCoBERT, a syntax-guided multi-modal contrastive pre-training approach for better code representations. Specially, we design two novel pre-training objectives originating from the symbolic and syntactic properties of source code, i.e., Identifier Prediction (IP) and AST Edge Prediction (TEP), which are designed to predict identifiers, and edges between two nodes of AST, respectively. Meanwhile, to exploit the complementary information in semantically equivalent modalities (i.e., code, comment, AST) of the code, we propose a multi-modal contrastive learning strategy to maximize the mutual information among different modalities. Extensive experiments on four downstream tasks related to code intelligence show that SynCoBERT advances the state-of-the-art with the same pre-training corpus and model size.

Paper Structure

This paper contains 33 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: A Python code snippet with its AST.
  • Figure 2: A part of the AST sequence obtained from the AST in Figure \ref{['fig:ast']}, blue arrows denote edges between nodes.
  • Figure 3: Different scenes of SynCoBERT pre-training. (a) SynCoBERT takes source code paired with comment and the corresponding AST as the input, and is pre-trained with MMLM, IP, TEP objectives. (b) Positive sampling for NL-PL paired data, (left) NL vs PL-AST, (right) NL-PL-AST vs NL-AST-PL. (c) An illustration about positive and negative pairs, including in-batch and cross-batch negative sampling.