Table of Contents
Fetching ...

CAMAL: Optimizing LSM-trees via Active Learning

Weiping Yu, Siqiang Luo, Zihao Yu, Gao Cong

TL;DR

Camal tackles the problem of tuning LSM-tree parameters for variable read/write workloads by fusing a complexity-based cost model with active learning, enabling decoupled parameter exploration, data-growth extrapolation, and online adaptation. The approach introduces a decoupled active-learning framework, an extrapolation strategy to avoid retraining, and a dynamic mode (with a lazy transition LSM-tree) to handle workload shifts, all integrated into RocksDB. Key contributions include the first application of active learning to LSM-tree instance optimization, a hierarchical, decoupled sampling scheme, efficient extrapolation across data growth, and a dynamic tuning mechanism, supported by empirical results showing substantial latency reductions and significant training-time savings. The work demonstrates practical impact by delivering near-optimal configurations with fewer samples, enabling responsive tuning for real-world storage systems under diverse and evolving workloads.

Abstract

We use machine learning to optimize LSM-tree structure, aiming to reduce the cost of processing various read/write operations. We introduce a new approach Camal, which boasts the following features: (1) ML-Aided: Camal is the first attempt to apply active learning to tune LSM-tree based key-value stores. The learning process is coupled with traditional cost models to improve the training process; (2) Decoupled Active Learning: backed by rigorous analysis, Camal adopts active learning paradigm based on a decoupled tuning of each parameter, which further accelerates the learning process; (3) Easy Extrapolation: Camal adopts an effective mechanism to incrementally update the model with the growth of the data size; (4) Dynamic Mode: Camal is able to tune LSM-tree online under dynamically changing workloads; (5) Significant System Improvement: By integrating Camal into a full system RocksDB, the system performance improves by 28% on average and up to 8x compared to a state-of-the-art RocksDB design.

CAMAL: Optimizing LSM-trees via Active Learning

TL;DR

Camal tackles the problem of tuning LSM-tree parameters for variable read/write workloads by fusing a complexity-based cost model with active learning, enabling decoupled parameter exploration, data-growth extrapolation, and online adaptation. The approach introduces a decoupled active-learning framework, an extrapolation strategy to avoid retraining, and a dynamic mode (with a lazy transition LSM-tree) to handle workload shifts, all integrated into RocksDB. Key contributions include the first application of active learning to LSM-tree instance optimization, a hierarchical, decoupled sampling scheme, efficient extrapolation across data growth, and a dynamic tuning mechanism, supported by empirical results showing substantial latency reductions and significant training-time savings. The work demonstrates practical impact by delivering near-optimal configurations with fewer samples, enabling responsive tuning for real-world storage systems under diverse and evolving workloads.

Abstract

We use machine learning to optimize LSM-tree structure, aiming to reduce the cost of processing various read/write operations. We introduce a new approach Camal, which boasts the following features: (1) ML-Aided: Camal is the first attempt to apply active learning to tune LSM-tree based key-value stores. The learning process is coupled with traditional cost models to improve the training process; (2) Decoupled Active Learning: backed by rigorous analysis, Camal adopts active learning paradigm based on a decoupled tuning of each parameter, which further accelerates the learning process; (3) Easy Extrapolation: Camal adopts an effective mechanism to incrementally update the model with the growth of the data size; (4) Dynamic Mode: Camal is able to tune LSM-tree online under dynamically changing workloads; (5) Significant System Improvement: By integrating Camal into a full system RocksDB, the system performance improves by 28% on average and up to 8x compared to a state-of-the-art RocksDB design.
Paper Structure (14 sections, 2 theorems, 6 equations, 7 figures, 2 tables, 2 algorithms)

This paper contains 14 sections, 2 theorems, 6 equations, 7 figures, 2 tables, 2 algorithms.

Key Result

lemma 1

Given the prevalent cost model dayan2018dostoevskydayan2017monkey for leveling, the process of configuration optimization can be decoupled into two distinct stages: firstly, determining the optimal value of $T^*$, and secondly, allocating memory between $M_b$ and $M_f$. This decoupling ensures the a

Figures (7)

  • Figure 1: Illustrating plain ML approach (e.g., polynomial regression), plain active learning approach and our Camal regarding the tradeoff between the training samples and system performance.
  • Figure 2: Parameters of LSM-trees and workloads used throughout the paper, complexity-based I/O cost models, and base sampling space of instance-optimized LSM-trees.
  • Figure 3: The overview of Camal: in its training phase, Camal first decouples the sampling space and identifies the theoretical optimum using a complexity-based model. Following this, it integrates an ML model to facilitate an active learning cycle, where the model is continuously iterated to select subsequent samples. To address data growth, we introduce an extrapolation strategy that extends the training optimals for larger data sizes without retraining. Both of these methods are designed to reduce training costs. Additionally, to meet practical demands, we apply the extrapolation strategy to enhance Camal for dynamic environments, which also includes equipping LSM-trees with the ability to dynamically adapt to changing parameters.
  • Figure 4: Running example of the dynamic system mode. We employ a lazy transition strategy to keep the transition cost lower.
  • Figure 5: Classic methods relying solely on theoretical cost models necessitate no training samples but demonstrate limited optimization. Conversely, other ML or AL methods that employ plain sampling strategies achieve high performance but entail relatively extensive sampling costs. Camal is the only complexity-analysis driven ML-aided framework that achieves high performance while significantly reducing sampling costs.
  • ...and 2 more figures

Theorems & Definitions (2)

  • lemma 1
  • lemma 2