Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent
Geert Heyman, Tom Van Cutsem
TL;DR
Annotated code search leverages natural language descriptions paired with code to better capture code intent. The authors develop a domain-specific retrieval framework with separate description-query and code-query embeddings, and an ensemble that combines them; three PACS benchmarks (CoNaLa, StaQC-py, SO-DS) demonstrate substantial gains over code-only baselines, with up to 20.6% improvements in MRR and notable recall gains. The work highlights the value of descriptions for code search, demonstrates effective fine-tuning of the Universal Sentence Encoder for software-domain similarity, and shows that combining description and code signals yields the strongest performance, while noting challenges such as code evolution and dataset quality.
Abstract
In this work, we propose and study annotated code search: the retrieval of code snippets paired with brief descriptions of their intent using natural language queries. On three benchmark datasets, we investigate how code retrieval systems can be improved by leveraging descriptions to better capture the intents of code snippets. Building on recent progress in transfer learning and natural language processing, we create a domain-specific retrieval model for code annotated with a natural language description. We find that our model yields significantly more relevant search results (with absolute gains up to 20.6% in mean reciprocal rank) compared to state-of-the-art code retrieval methods that do not use descriptions but attempt to compute the intent of snippets solely from unannotated code.
