CoNCRA: A Convolutional Neural Network Code Retrieval Approach
Marcelo de Rezende Martins, Marco A. Gerosa
TL;DR
CoNCRA presents a CNN-based semantic code search framework that learns a joint embedding for natural-language queries and code snippets. By combining Word2Vec skip-gram word embeddings with a CNN-based sentence encoder and a cosine-based hinge loss for ranking, it achieves state-of-the-art-like performance on StaQC-derived Python questions, outperforming baselines in MRR and top-k retrieval. The approach demonstrates strong top-3 placement and a solid TOP-1 rate, with implications for improved developer search experiences and potential transfer to other coding domains. Future work includes exploring transfer learning, broader datasets, and transformer-based alternatives to further boost performance.
Abstract
Software developers routinely search for code using general-purpose search engines. However, these search engines cannot find code semantically unless it has an accompanying description. We propose a technique for semantic code search: A Convolutional Neural Network approach to code retrieval (CoNCRA). Our technique aims to find the code snippet that most closely matches the developer's intent, expressed in natural language. We evaluated our approach's efficacy on a dataset composed of questions and code snippets collected from Stack Overflow. Our preliminary results showed that our technique, which prioritizes local interactions (words nearby), improved the state-of-the-art (SOTA) by 5% on average, retrieving the most relevant code snippets in the top 3 (three) positions by almost 80% of the time. Therefore, our technique is promising and can improve the efficacy of semantic code retrieval.
