Unlocking Insights: Semantic Search in Jupyter Notebooks
Lan Li, Jinpeng Lv
TL;DR
Addresses semantic search in Jupyter Notebooks, where mixed text and code content challenges traditional keyword approaches. Proposes an end-to-end framework that preprocesses notebook content, derives function-level embeddings to overcome embedding token limits, and uses GPT-4-32k for code summarization when needed, with a Weaviate vector store for retrieval. Implements three query types (Exact, User-Defined, Code Summary) and demonstrates that GPT-generated code summaries closely align with the underlying content via cosine-distance analysis. The work lays groundwork for robust, notebook-aware semantic search with future multimodal and multi-language extensions.
Abstract
Semantic search, a process aimed at delivering highly relevant search results by comprehending the searcher's intent and the contextual meaning of terms within a searchable dataspace, plays a pivotal role in information retrieval. In this paper, we investigate the application of large language models to enhance semantic search capabilities, specifically tailored for the domain of Jupyter Notebooks. Our objective is to retrieve generated outputs, such as figures or tables, associated functions and methods, and other pertinent information. We demonstrate a semantic search framework that achieves a comprehensive semantic understanding of the entire notebook's contents, enabling it to effectively handle various types of user queries. Key components of this framework include: 1). A data preprocessor is designed to handle diverse types of cells within Jupyter Notebooks, encompassing both markdown and code cells. 2). An innovative methodology is devised to address token size limitations that arise with code-type cells. We implement a finer-grained approach to data input, transitioning from the cell level to the function level, effectively resolving these issues.
