Table of Contents
Fetching ...

MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage

Yongjun He, Roger Waleffe, Zhichao Han, Johnu George, Binhang Yuan, Zitao Zhang, Yinan Shan, Yang Zhao, Debojyoti Dutta, Theodoros Rekatsinas, Ce Zhang

TL;DR

The paper addresses scalability bottlenecks in training large embedding models caused by data stalls and staleness when using disk-based storage. It introduces MLKV, a FASTER-based disk-based key-value storage framework with simple interfaces (Open/Get/Put/Lookahead) and optimizations including bounded staleness consistency and look-ahead prefetching to decouple storage from computation. Through extensive experiments on open datasets and real-world eBay workloads, MLKV outperforms offloading strategies by 1.6-12.6x on large workloads and achieves competitive performance with specialized in-memory frameworks on smaller tasks. The work demonstrates improved scalability, extensibility, and energy efficiency, and provides open-source code for broader adoption.

Abstract

Many modern machine learning (ML) methods rely on embedding models to learn vector representations (embeddings) for a set of entities (embedding tables). As increasingly diverse ML applications utilize embedding models and embedding tables continue to grow in size and number, there has been a surge in the ad-hoc development of specialized frameworks targeted to train large embedding models for specific tasks. Although the scalability issues that arise in different embedding model training tasks are similar, each of these frameworks independently reinvents and customizes storage components for specific tasks, leading to substantial duplicated engineering efforts in both development and deployment. This paper presents MLKV, an efficient, extensible, and reusable data storage framework designed to address the scalability challenges in embedding model training, specifically data stall and staleness. MLKV augments disk-based key-value storage by democratizing optimizations that were previously exclusive to individual specialized frameworks and provides easy-to-use interfaces for embedding model training tasks. Extensive experiments on open-source workloads, as well as applications in eBay's payment transaction risk detection and seller payment risk detection, show that MLKV outperforms offloading strategies built on top of industrial-strength key-value stores by 1.6-12.6x. MLKV is open-source at https://github.com/llm-db/MLKV.

MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage

TL;DR

The paper addresses scalability bottlenecks in training large embedding models caused by data stalls and staleness when using disk-based storage. It introduces MLKV, a FASTER-based disk-based key-value storage framework with simple interfaces (Open/Get/Put/Lookahead) and optimizations including bounded staleness consistency and look-ahead prefetching to decouple storage from computation. Through extensive experiments on open datasets and real-world eBay workloads, MLKV outperforms offloading strategies by 1.6-12.6x on large workloads and achieves competitive performance with specialized in-memory frameworks on smaller tasks. The work demonstrates improved scalability, extensibility, and energy efficiency, and provides open-source code for broader adoption.

Abstract

Many modern machine learning (ML) methods rely on embedding models to learn vector representations (embeddings) for a set of entities (embedding tables). As increasingly diverse ML applications utilize embedding models and embedding tables continue to grow in size and number, there has been a surge in the ad-hoc development of specialized frameworks targeted to train large embedding models for specific tasks. Although the scalability issues that arise in different embedding model training tasks are similar, each of these frameworks independently reinvents and customizes storage components for specific tasks, leading to substantial duplicated engineering efforts in both development and deployment. This paper presents MLKV, an efficient, extensible, and reusable data storage framework designed to address the scalability challenges in embedding model training, specifically data stall and staleness. MLKV augments disk-based key-value storage by democratizing optimizations that were previously exclusive to individual specialized frameworks and provides easy-to-use interfaces for embedding model training tasks. Extensive experiments on open-source workloads, as well as applications in eBay's payment transaction risk detection and seller payment risk detection, show that MLKV outperforms offloading strategies built on top of industrial-strength key-value stores by 1.6-12.6x. MLKV is open-source at https://github.com/llm-db/MLKV.

Paper Structure

This paper contains 19 sections, 3 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Embedding table and neural network paradigm.
  • Figure 2: Scalability issues in embedding model training: (left and middle) poor throughput in synchronous training due to data stalls; (right) degraded model quality in fully asynchronous training due to staleness. We train DLRMs on the Criteo dataset using the PERSIA (as the computation layer) and FASTER (as the storage layer).
  • Figure 3: Example usage of MLKV.
  • Figure 4: Embedding model training with MLKV.
  • Figure 5: Key designs of MLKV.
  • ...and 6 more figures