CodeCSE: A Simple Multilingual Model for Code and Comment Sentence Embeddings
Anthony Varkey, Siyuan Jiang, Weijing Huang
TL;DR
CodeCSE introduces a multilingual, contrastively trained encoder that maps code functions and their descriptions into a shared sentence-embedding space, enabling zero-shot code search without language-specific fine-tuning. Built on GraphCodeBERT with data-flow masking, CodeCSE demonstrates competitive, sometimes superior, zero-shot code-search performance against language-tuned baselines and other contrastive models, and is openly released for reproducibility. The work includes extensive ablations on pooling, MLP depth, and loss functions, and analyzes language-dependent effects during pretraining and evaluation. Overall, CodeCSE advances practical code-comment embeddings and suggests paths for scaling with larger corpora and self-supervised enhancements.
Abstract
Pretrained language models for code token embeddings are used in code search, code clone detection, and other code-related tasks. Similarly, code function embeddings are useful in such tasks. However, there are no out-of-box models for function embeddings in the current literature. So, this paper proposes CodeCSE, a contrastive learning model that learns embeddings for functions and their descriptions in one space. We evaluated CodeCSE using code search. CodeCSE's multi-lingual zero-shot approach is as efficient as the models finetuned from GraphCodeBERT for specific languages. CodeCSE is open source at https://github.com/emu-se/codecse and the pretrained model is available at the HuggingFace public hub: https://huggingface.co/sjiang1/codecse
