Typhon: Automatic Recommendation of Relevant Code Cells in Jupyter Notebooks
Chaiyong Ragkhitwetsagul, Veerakit Prasertpol, Natanon Ritta, Paphon Sae-Wong, Thanapon Noraset, Morakot Choetkiertikul
TL;DR
This paper presents Typhon, a retrieval-based system for automatically recommending relevant code cells in Jupyter notebooks by matching descriptive markdown text against a large corpus of notebook pairs. It compares BM25-based text retrieval with embedding-based UniXcoder retrieval, using the Kaggle-derived KGTorrent corpus and a vector search backend to fetch candidate code cells. Results show moderate accuracy for Matplotlib-related code, with UniXcoder often outperforming BM25 at higher ranks and BM25 improving with stemming/lemmatization. The work demonstrates the feasibility of code-cell reuse in notebooks and outlines directions for future comparisons to generative tools and broader embedding techniques.
Abstract
At present, code recommendation tools have gained greater importance to many software developers in various areas of expertise. Having code recommendation tools has enabled better productivity and performance in developing the code in software and made it easier for developers to find code examples and learn from them. This paper proposes Typhon, an approach to automatically recommend relevant code cells in Jupyter notebooks. Typhon tokenizes developers' markdown description cells and looks for the most similar code cells from the database using text similarities such as the BM25 ranking function or CodeBERT, a machine-learning approach. Then, the algorithm computes the similarity distance between the tokenized query and markdown cells to return the most relevant code cells to the developers. We evaluated the Typhon tool on Jupyter notebooks from Kaggle competitions and found that the approach can recommend code cells with moderate accuracy. The approach and results in this paper can lead to further improvements in code cell recommendations in Jupyter notebooks.
