Table of Contents
Fetching ...

Accelerating String-Key Learned Index Structures via Memoization-based Incremental Training

Minsu Kim, Jinwoo Hwang, Guseul Heo, Seiyeon Cho, Divya Mahajan, Jongse Park

TL;DR

This paper tackles the retraining bottleneck in updatable string-key learned indexes, where QR-based training over large, variable-length key sets degrades throughput. It introduces SIA, an algorithm-hardware co-design that uses memoization to enable incremental index learning and offloads training to an FPGA, freeing CPU resources for inference. The approach yields 2.6×–3.4× throughput improvements on real workloads (YCSB and Twitter traces) with modest memory overhead (~6%) and favorable power characteristics. By demonstrating a plug-and-play runtime interface and an FPGA-accelerated training pipeline, SIA offers a practical path to scalable, high-throughput, low-latency learned indexes in dynamic, update-heavy environments.

Abstract

Learned indexes use machine learning models to learn the mappings between keys and their corresponding positions in key-value indexes. These indexes use the mapping information as training data. Learned indexes require frequent retrainings of their models to incorporate the changes introduced by update queries. To efficiently retrain the models, existing learned index systems often harness a linear algebraic QR factorization technique that performs matrix decomposition. This factorization approach processes all key-position pairs during each retraining, resulting in compute operations that grow linearly with the total number of keys and their lengths. Consequently, the retrainings create a severe performance bottleneck, especially for variable-length string keys, while the retrainings are crucial for maintaining high prediction accuracy and in turn, ensuring low query service latency. To address this performance problem, we develop an algorithm-hardware co-designed string-key learned index system, dubbed SIA. In designing SIA, we leverage a unique algorithmic property of the matrix decomposition-based training method. Exploiting the property, we develop a memoization-based incremental training scheme, which only requires computation over updated keys, while decomposition results of non-updated keys from previous computations can be reused. We further enhance SIA to offload a portion of this training process to an FPGA accelerator to not only relieve CPU resources for serving index queries (i.e., inference), but also accelerate the training itself. Our evaluation shows that compared to ALEX, LIPP, and SIndex, a state-of-the-art learned index systems, SIA-accelerated learned indexes offer 2.6x and 3.4x higher throughput on the two real-world benchmark suites, YCSB and Twitter cache trace, respectively.

Accelerating String-Key Learned Index Structures via Memoization-based Incremental Training

TL;DR

This paper tackles the retraining bottleneck in updatable string-key learned indexes, where QR-based training over large, variable-length key sets degrades throughput. It introduces SIA, an algorithm-hardware co-design that uses memoization to enable incremental index learning and offloads training to an FPGA, freeing CPU resources for inference. The approach yields 2.6×–3.4× throughput improvements on real workloads (YCSB and Twitter traces) with modest memory overhead (~6%) and favorable power characteristics. By demonstrating a plug-and-play runtime interface and an FPGA-accelerated training pipeline, SIA offers a practical path to scalable, high-throughput, low-latency learned indexes in dynamic, update-heavy environments.

Abstract

Learned indexes use machine learning models to learn the mappings between keys and their corresponding positions in key-value indexes. These indexes use the mapping information as training data. Learned indexes require frequent retrainings of their models to incorporate the changes introduced by update queries. To efficiently retrain the models, existing learned index systems often harness a linear algebraic QR factorization technique that performs matrix decomposition. This factorization approach processes all key-position pairs during each retraining, resulting in compute operations that grow linearly with the total number of keys and their lengths. Consequently, the retrainings create a severe performance bottleneck, especially for variable-length string keys, while the retrainings are crucial for maintaining high prediction accuracy and in turn, ensuring low query service latency. To address this performance problem, we develop an algorithm-hardware co-designed string-key learned index system, dubbed SIA. In designing SIA, we leverage a unique algorithmic property of the matrix decomposition-based training method. Exploiting the property, we develop a memoization-based incremental training scheme, which only requires computation over updated keys, while decomposition results of non-updated keys from previous computations can be reused. We further enhance SIA to offload a portion of this training process to an FPGA accelerator to not only relieve CPU resources for serving index queries (i.e., inference), but also accelerate the training itself. Our evaluation shows that compared to ALEX, LIPP, and SIndex, a state-of-the-art learned index systems, SIA-accelerated learned indexes offer 2.6x and 3.4x higher throughput on the two real-world benchmark suites, YCSB and Twitter cache trace, respectively.
Paper Structure (31 sections, 18 figures, 2 tables, 2 algorithms)

This paper contains 31 sections, 18 figures, 2 tables, 2 algorithms.

Figures (18)

  • Figure 1: Increasing retraining time as the size of a learned index system grows, resulting from a stream of update queries. Markers on the same line represent sequential retraining runs, where markers positioned to the left precede those on the right. We use various key lengths - 16, 32, 64, and 96.
  • Figure 2: (a) Read-only learned index in a hierarchical structure, (b) updatable learned index, and (c) SIA: the proposed updatable string-key learned index that leverages computation reuse and hardware acceleration to improve the system throughput.
  • Figure 3: Throughput results of two conventional indexes and three learned indexes for YCSB workloads.
  • Figure 4: Throughput as training time varies from 5 to 300 seconds. Training does not utilize any CPU cycles. The insertion ratio sweeps from 0% (read-only) to 50%.
  • Figure 5: Throughput with varying threads for (a) inference and 1 for training, (b) training and 1 for inference.
  • ...and 13 more figures