Table of Contents
Fetching ...

CoNCRA: A Convolutional Neural Network Code Retrieval Approach

Marcelo de Rezende Martins, Marco A. Gerosa

TL;DR

CoNCRA presents a CNN-based semantic code search framework that learns a joint embedding for natural-language queries and code snippets. By combining Word2Vec skip-gram word embeddings with a CNN-based sentence encoder and a cosine-based hinge loss for ranking, it achieves state-of-the-art-like performance on StaQC-derived Python questions, outperforming baselines in MRR and top-k retrieval. The approach demonstrates strong top-3 placement and a solid TOP-1 rate, with implications for improved developer search experiences and potential transfer to other coding domains. Future work includes exploring transfer learning, broader datasets, and transformer-based alternatives to further boost performance.

Abstract

Software developers routinely search for code using general-purpose search engines. However, these search engines cannot find code semantically unless it has an accompanying description. We propose a technique for semantic code search: A Convolutional Neural Network approach to code retrieval (CoNCRA). Our technique aims to find the code snippet that most closely matches the developer's intent, expressed in natural language. We evaluated our approach's efficacy on a dataset composed of questions and code snippets collected from Stack Overflow. Our preliminary results showed that our technique, which prioritizes local interactions (words nearby), improved the state-of-the-art (SOTA) by 5% on average, retrieving the most relevant code snippets in the top 3 (three) positions by almost 80% of the time. Therefore, our technique is promising and can improve the efficacy of semantic code retrieval.

CoNCRA: A Convolutional Neural Network Code Retrieval Approach

TL;DR

CoNCRA presents a CNN-based semantic code search framework that learns a joint embedding for natural-language queries and code snippets. By combining Word2Vec skip-gram word embeddings with a CNN-based sentence encoder and a cosine-based hinge loss for ranking, it achieves state-of-the-art-like performance on StaQC-derived Python questions, outperforming baselines in MRR and top-k retrieval. The approach demonstrates strong top-3 placement and a solid TOP-1 rate, with implications for improved developer search experiences and potential transfer to other coding domains. Future work includes exploring transfer learning, broader datasets, and transformer-based alternatives to further boost performance.

Abstract

Software developers routinely search for code using general-purpose search engines. However, these search engines cannot find code semantically unless it has an accompanying description. We propose a technique for semantic code search: A Convolutional Neural Network approach to code retrieval (CoNCRA). Our technique aims to find the code snippet that most closely matches the developer's intent, expressed in natural language. We evaluated our approach's efficacy on a dataset composed of questions and code snippets collected from Stack Overflow. Our preliminary results showed that our technique, which prioritizes local interactions (words nearby), improved the state-of-the-art (SOTA) by 5% on average, retrieving the most relevant code snippets in the top 3 (three) positions by almost 80% of the time. Therefore, our technique is promising and can improve the efficacy of semantic code retrieval.

Paper Structure

This paper contains 12 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of the joint embedding technique for code retrieval. Two neural networks map a question and a code snippet into a common vector space. The distance between the vectors reflects the relevance of a code snippet to a question. Adapted from cambronero-deep-code-search-2019.
  • Figure 2: 2D picture of continuous vectors of the 66 most frequent words from a Python corpus $V$. The illustration was generated by t-SNE, which allows 2D visualization from high-dimensional data. We applied word2vec with skip-gram and a parameter window $5$.
  • Figure 3: Schematic drawing of our Convolutional Neural Networks (CNN) operations. Our example shows 4 filters $\bm{F} \in \mathbb{R}^{m X d X f}$ with window size $m = 2$. We slide each filter across the height of the input $\bm{x} \in \mathbb{R}^{n X d}$, where $n = 6$ and $d = 5$, and computes the dot product between the entries of the filter and the input karpathy-course-cnn-2016. It returns a feature map $\bm{c}$. In the end, a max pooling layer resizes every feature map spatially, obtaining the final vector $\bm{o}$. We use the vector $\bm{o}$ as our question and code snippet embedding. Adapted from zhang-guide-convolutional-cnn-embedding-ilustration:2015.
  • Figure 4: Example of questions and that our model (CoNCRA) gave. Our model answered the first question (Q1) correctly, selecting an answer (A 1.1) based on the Numpy library, which adds support for multi-dimensional and large arrays in Python. The second answer (A 1.2) for the first question returns the last element of an array, which is incorrect, but seems interesting. The third one (A 1.3) shows the array's dimension using the Numpy library, so our model found a correlation between Numpy and array operations. In the other example (Q2), the first answer is incorrect, as it checks the presence of elements in a matrix, not an array. Again, our model showed a Numpy answer. The second one (A 2.2) is the correct answer, and the third one (A 2.3) checks how many times a sequence repeats in a data frame.
  • Figure 5: Histogram of the first positions observed for the annotated code snippet during the final evaluation. The labels shared-cnn-with-bn, unif, and embedding refer to lines F3, A1, and B1, respectively, in Table \ref{['table:resultados']}.