Table of Contents
Fetching ...

CodeCSE: A Simple Multilingual Model for Code and Comment Sentence Embeddings

Anthony Varkey, Siyuan Jiang, Weijing Huang

TL;DR

CodeCSE introduces a multilingual, contrastively trained encoder that maps code functions and their descriptions into a shared sentence-embedding space, enabling zero-shot code search without language-specific fine-tuning. Built on GraphCodeBERT with data-flow masking, CodeCSE demonstrates competitive, sometimes superior, zero-shot code-search performance against language-tuned baselines and other contrastive models, and is openly released for reproducibility. The work includes extensive ablations on pooling, MLP depth, and loss functions, and analyzes language-dependent effects during pretraining and evaluation. Overall, CodeCSE advances practical code-comment embeddings and suggests paths for scaling with larger corpora and self-supervised enhancements.

Abstract

Pretrained language models for code token embeddings are used in code search, code clone detection, and other code-related tasks. Similarly, code function embeddings are useful in such tasks. However, there are no out-of-box models for function embeddings in the current literature. So, this paper proposes CodeCSE, a contrastive learning model that learns embeddings for functions and their descriptions in one space. We evaluated CodeCSE using code search. CodeCSE's multi-lingual zero-shot approach is as efficient as the models finetuned from GraphCodeBERT for specific languages. CodeCSE is open source at https://github.com/emu-se/codecse and the pretrained model is available at the HuggingFace public hub: https://huggingface.co/sjiang1/codecse

CodeCSE: A Simple Multilingual Model for Code and Comment Sentence Embeddings

TL;DR

CodeCSE introduces a multilingual, contrastively trained encoder that maps code functions and their descriptions into a shared sentence-embedding space, enabling zero-shot code search without language-specific fine-tuning. Built on GraphCodeBERT with data-flow masking, CodeCSE demonstrates competitive, sometimes superior, zero-shot code-search performance against language-tuned baselines and other contrastive models, and is openly released for reproducibility. The work includes extensive ablations on pooling, MLP depth, and loss functions, and analyzes language-dependent effects during pretraining and evaluation. Overall, CodeCSE advances practical code-comment embeddings and suggests paths for scaling with larger corpora and self-supervised enhancements.

Abstract

Pretrained language models for code token embeddings are used in code search, code clone detection, and other code-related tasks. Similarly, code function embeddings are useful in such tasks. However, there are no out-of-box models for function embeddings in the current literature. So, this paper proposes CodeCSE, a contrastive learning model that learns embeddings for functions and their descriptions in one space. We evaluated CodeCSE using code search. CodeCSE's multi-lingual zero-shot approach is as efficient as the models finetuned from GraphCodeBERT for specific languages. CodeCSE is open source at https://github.com/emu-se/codecse and the pretrained model is available at the HuggingFace public hub: https://huggingface.co/sjiang1/codecse
Paper Structure (38 sections, 6 equations, 5 figures, 8 tables)

This paper contains 38 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Overview of contrastive learning.
  • Figure 2: CodeCSE's dual-encoder model architecture. Source code and NL documents are encoded independently by an encoder. The encoder for source code and the encoder for NL documents share parameters. The encoder outputs are fed into contrastive learning. In contrastive learning, the outputs are pooled into $c$ and $d$, which are the sentence embeddings for source code and the NL document respectively.
  • Figure 3: CodeCSE's contrastive learning after pooling. The contrastive learning module takes a minibatch of pairs of code and document embeddings as inputs. $n$ is the size of the minibatch. ($c_i$, $d_j$) is a pair of sentence embeddings of a piece of code and a document. CodeCSE calculates the similarity of ($c_i$, $d_j$) for any $i$ and $j$ in $[1,...,n]$. If $i$ equals to $j$, the pair is a positive sample. Otherwise, the pair is a negative sample. The training loss is calculated based on the similarities.
  • Figure 4: Alignment scores in the ablation studies. Rectangles denote $symmetric$ loss, triangulars denote $asymmetric\_doc$, and circles denote $asymmetric\_code$.
  • Figure 5: A Ruby function in the validation set for the code search task. Its corresponding query is "Extract the query parameters and append them to the url". The comments in the function are removed in the preprocessing step.