Table of Contents
Fetching ...

Embedding-based search in JetBrains IDEs

Evgeny Abramov, Nikolai Palchikov

TL;DR

The paper addresses semantic search within JetBrains IDEs by introducing an on-device embedding-based retrieval system for Search Everywhere, overcoming limitations of regex-based approaches and server-based solutions. It proposes a compact, single-model pipeline that precomputes fixed-size embeddings, uses cosine similarity for semantic matching, and streams results to preserve responsiveness, all while maintaining privacy by avoiding external data transmission. The authors compare a small embedding architecture against pretrained models, showing competitive ranking metrics (e.g., NDCG@10, MRR@10) and provide a detailed analysis of indexing latency, memory, and storage footprints; they also demonstrate benefits from indexing full function bodies. The work lays a pragmatic foundation for real-time, context-aware search in IDEs and outlines concrete future directions, including model optimization, longer-context representations, and online evaluation to guide production deployment. This has practical significance for improving developer productivity by enabling robust semantic discovery directly within developer environments without sacrificing performance or privacy.

Abstract

Most modern Integrated Development Environments (IDEs) and code editors have a feature to search across available functionality and items in an open project. In JetBrains IDEs, this feature is called Search Everywhere: it allows users to search for files, actions, classes, symbols, settings, and anything from VCS history from a single entry point. However, it works with the candidates obtained by algorithms that don't account for semantics, e.g., synonyms, complex word permutations, part of the speech modifications, and typos. In this work, we describe the machine learning approach we implemented to improve the discoverability of search items. We also share the obstacles encountered during this process and how we overcame them.

Embedding-based search in JetBrains IDEs

TL;DR

The paper addresses semantic search within JetBrains IDEs by introducing an on-device embedding-based retrieval system for Search Everywhere, overcoming limitations of regex-based approaches and server-based solutions. It proposes a compact, single-model pipeline that precomputes fixed-size embeddings, uses cosine similarity for semantic matching, and streams results to preserve responsiveness, all while maintaining privacy by avoiding external data transmission. The authors compare a small embedding architecture against pretrained models, showing competitive ranking metrics (e.g., NDCG@10, MRR@10) and provide a detailed analysis of indexing latency, memory, and storage footprints; they also demonstrate benefits from indexing full function bodies. The work lays a pragmatic foundation for real-time, context-aware search in IDEs and outlines concrete future directions, including model optimization, longer-context representations, and online evaluation to guide production deployment. This has practical significance for improving developer productivity by enabling robust semantic discovery directly within developer environments without sacrificing performance or privacy.

Abstract

Most modern Integrated Development Environments (IDEs) and code editors have a feature to search across available functionality and items in an open project. In JetBrains IDEs, this feature is called Search Everywhere: it allows users to search for files, actions, classes, symbols, settings, and anything from VCS history from a single entry point. However, it works with the candidates obtained by algorithms that don't account for semantics, e.g., synonyms, complex word permutations, part of the speech modifications, and typos. In this work, we describe the machine learning approach we implemented to improve the discoverability of search items. We also share the obstacles encountered during this process and how we overcame them.
Paper Structure (8 sections, 4 figures, 3 tables)

This paper contains 8 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Search Everywhere dialog with enabled embedding-based search
  • Figure 2: Indexing process
  • Figure 3: Search process
  • Figure 4: Parallel coordinates plot to compare multiple metrics for several similarity thresholds