Table of Contents
Fetching ...

Improving Code Search with Hard Negative Sampling Based on Fine-tuning

Hande Dong, Jiayi Lin, Yanlin Wang, Yichong Leng, Jiawei Chen, Yutao Xie

TL;DR

A cross-encoder architecture for code search that jointly encodes the concatenation of query and code and a Retriever-Ranker (RR) framework that cascades the dual-encoder and cross-encoder to promote the efficiency of evaluation and online serving is introduced.

Abstract

Pre-trained code models have emerged as the state-of-the-art paradigm for code search tasks. The paradigm involves pre-training the model on search-irrelevant tasks such as masked language modeling, followed by the fine-tuning stage, which focuses on the search-relevant task. The typical fine-tuning method is to employ a dual-encoder architecture to encode semantic embeddings of query and code separately, and then calculate their similarity based on the embeddings. However, the typical dual-encoder architecture falls short in modeling token-level interactions between query and code, which limits the capabilities of model. To address this limitation, we introduce a cross-encoder architecture for code search that jointly encodes the concatenation of query and code. We further introduce a Retriever-Ranker (RR) framework that cascades the dual-encoder and cross-encoder to promote the efficiency of evaluation and online serving. Moreover, we present a ranking-based hard negative sampling (PS) method to improve the ability of cross-encoder to distinguish hard negative codes, which further enhances the cascaded RR framework. Experiments on four datasets using three code models demonstrate the superiority of our proposed method. We have made the code available at https://github.com/DongHande/R2PS.

Improving Code Search with Hard Negative Sampling Based on Fine-tuning

TL;DR

A cross-encoder architecture for code search that jointly encodes the concatenation of query and code and a Retriever-Ranker (RR) framework that cascades the dual-encoder and cross-encoder to promote the efficiency of evaluation and online serving is introduced.

Abstract

Pre-trained code models have emerged as the state-of-the-art paradigm for code search tasks. The paradigm involves pre-training the model on search-irrelevant tasks such as masked language modeling, followed by the fine-tuning stage, which focuses on the search-relevant task. The typical fine-tuning method is to employ a dual-encoder architecture to encode semantic embeddings of query and code separately, and then calculate their similarity based on the embeddings. However, the typical dual-encoder architecture falls short in modeling token-level interactions between query and code, which limits the capabilities of model. To address this limitation, we introduce a cross-encoder architecture for code search that jointly encodes the concatenation of query and code. We further introduce a Retriever-Ranker (RR) framework that cascades the dual-encoder and cross-encoder to promote the efficiency of evaluation and online serving. Moreover, we present a ranking-based hard negative sampling (PS) method to improve the ability of cross-encoder to distinguish hard negative codes, which further enhances the cascaded RR framework. Experiments on four datasets using three code models demonstrate the superiority of our proposed method. We have made the code available at https://github.com/DongHande/R2PS.
Paper Structure (28 sections, 9 equations, 4 figures, 5 tables)

This paper contains 28 sections, 9 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The dual-encoder architecture and the cross-encoder architecture for code search. $s^q_1,s^q_2, \cdots,s^q_l$ denotes the token sequence of query $q$, and $s^c_1,s^c_2, \cdots,s^c_m$ denotes the token sequence of code $c$. (a) In dual-encoder, we input the token sequence of the query and code separately; (b) In cross-encoder, we input the token sequence concatenation of the query and the code.
  • Figure 2: An overview of R2PS for code search.
  • Figure 3: A case of a query and its relevant code. The same colors indicate that these tokens match between query and code.
  • Figure 4: Performance and response time of the RR and R2PS patched UniXcoder with the different number of retrieved code $k$. The below dashed line is the performance of UniXcoder without RR or R2PS patch.