Table of Contents
Fetching ...

Trigram-Based Persistent IDE Indices with Quick Startup

Zakhar Iakovlev, Alexey Chulkov, Nikita Golikov, Vyacheslav Lukianov, Nikita Zinoviev, Dmitry Ivanov, Vitaly Aksenov

TL;DR

This work addresses the slow, version-dependent search bottleneck in large code repositories by proposing a persistent trigram index that enables zero-time startup and seamless history-aware search. The core idea combines an on-disk LMDB-based trigram index with a revision-tree of deltas to allow fast checkout and commit, avoiding rebuilds for each version, and extends the approach with CamelHump search for symbol-level navigation. Key contributions include a detailed design of a delta-centric, persistent trigram data structure, efficient checkout/commit operations with complexities $O( ext{path size})$ and $O( ext{delta size})$, and a CamelHump extension that stores symbol occurrences per trigram and ranks results. Experimental evaluation on several open-source repositories demonstrates rapid index requests, scalable checkout times, and substantial memory efficiency, indicating strong practicality for cloud IDEs and remote services requiring near-instant startup and robust version-aware search.

Abstract

One common way to speed up the find operation within a set of text files involves a trigram index. This structure is merely a map from a trigram (sequence consisting of three characters) to a set of files which contain it. When searching for a pattern, potential file locations are identified by intersecting the sets related to the trigrams in the pattern. Then, the search proceeds only in these files. However, in a code repository, the trigram index evolves across different versions. Upon checking out a new version, this index is typically built from scratch, which is a time-consuming task, while we want our index to have almost zero-time startup. Thus, we explore the persistent version of a trigram index for full-text and key word patterns search. Our approach just uses the current version of the trigram index and applies only the changes between versions during checkout, significantly enhancing performance. Furthermore, we extend our data structure to accommodate CamelHump search for class and function names.

Trigram-Based Persistent IDE Indices with Quick Startup

TL;DR

This work addresses the slow, version-dependent search bottleneck in large code repositories by proposing a persistent trigram index that enables zero-time startup and seamless history-aware search. The core idea combines an on-disk LMDB-based trigram index with a revision-tree of deltas to allow fast checkout and commit, avoiding rebuilds for each version, and extends the approach with CamelHump search for symbol-level navigation. Key contributions include a detailed design of a delta-centric, persistent trigram data structure, efficient checkout/commit operations with complexities and , and a CamelHump extension that stores symbol occurrences per trigram and ranks results. Experimental evaluation on several open-source repositories demonstrates rapid index requests, scalable checkout times, and substantial memory efficiency, indicating strong practicality for cloud IDEs and remote services requiring near-instant startup and robust version-aware search.

Abstract

One common way to speed up the find operation within a set of text files involves a trigram index. This structure is merely a map from a trigram (sequence consisting of three characters) to a set of files which contain it. When searching for a pattern, potential file locations are identified by intersecting the sets related to the trigrams in the pattern. Then, the search proceeds only in these files. However, in a code repository, the trigram index evolves across different versions. Upon checking out a new version, this index is typically built from scratch, which is a time-consuming task, while we want our index to have almost zero-time startup. Thus, we explore the persistent version of a trigram index for full-text and key word patterns search. Our approach just uses the current version of the trigram index and applies only the changes between versions during checkout, significantly enhancing performance. Furthermore, we extend our data structure to accommodate CamelHump search for class and function names.
Paper Structure (23 sections, 1 figure, 2 tables)

This paper contains 23 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The time of the checkout operation as a function of trigrams number for xodus repository. For other repositories the dependence is exactly the same. It spends $2.15$ seconds for $10^6$ trigrams.