Zero-Shot Cross-Domain Code Search without Fine-Tuning
Keyu Liang, Zhongxin Liu, Chao Liu, Zhiyuan Wan, David Lo, Xiaohu Yang
TL;DR
CodeBridge tackles zero-shot cross-domain code search without fine-tuning by decomposing query-code matching into query-code, query-comment, and code-code schemas. It leverages zero-shot generation from large language models to produce queries and comments, embeds all entities, and fuses three similarity signals with a sampling-based strategy, yielding significant improvements over state-of-the-art PLM-based methods and competitive results with fine-tuning baselines. Empirical results across SQL, Solidity, and CoSQA demonstrate high complementarity among the three schemas, with CodeBridge outperforming baselines by approximately 21–25% in MRR and showing robustness to weights, retrieval models, and LLM choices. The approach offers a practical, tuning-free pathway for cross-domain code search, with favorable efficiency and broad applicability across domains lacking domain-specific labeled data.
Abstract
Code search aims to retrieve semantically relevant code snippets for natural language queries. While pre-trained language models (PLMs) have shown remarkable performance in this task, they struggle in cross-domain scenarios, often requiring costly fine-tuning or facing performance drops in zero-shot settings. RAPID, which generates synthetic data for model fine-tuning, is currently the only effective method for zero-shot cross-domain code search. Despite its effectiveness, RAPID demands substantial computational resources for fine-tuning and needs to maintain specialized models for each domain, underscoring the need for a zero-shot, fine-tuning-free approach for cross-domain code search. The key to tackling zero-shot cross-domain code search lies in bridging the gaps among domains. In this work, we propose to break the query-code matching process of code search into two simpler tasks: query-comment matching and code-code matching. Our empirical study reveals the strong complementarity among the three matching schemas in zero-shot cross-domain settings, i.e., query-code, query-comment, and code-code matching. Based on the findings, we propose CodeBridge, a zero-shot, fine-tuning-free approach for cross-domain code search. Specifically, CodeBridge uses Large Language Models (LLMs) to generate comments and pseudo-code, then combines query-code, query-comment, and code-code matching via PLM-based similarity scoring and sampling-based fusion. Experimental results show that our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively, across three datasets. Our approach also yields results that are better than or comparable to those of the zero-shot cross-domain code search approach RAPID, which requires costly fine-tuning.
