MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage
Yongjun He, Roger Waleffe, Zhichao Han, Johnu George, Binhang Yuan, Zitao Zhang, Yinan Shan, Yang Zhao, Debojyoti Dutta, Theodoros Rekatsinas, Ce Zhang
TL;DR
The paper addresses scalability bottlenecks in training large embedding models caused by data stalls and staleness when using disk-based storage. It introduces MLKV, a FASTER-based disk-based key-value storage framework with simple interfaces (Open/Get/Put/Lookahead) and optimizations including bounded staleness consistency and look-ahead prefetching to decouple storage from computation. Through extensive experiments on open datasets and real-world eBay workloads, MLKV outperforms offloading strategies by 1.6-12.6x on large workloads and achieves competitive performance with specialized in-memory frameworks on smaller tasks. The work demonstrates improved scalability, extensibility, and energy efficiency, and provides open-source code for broader adoption.
Abstract
Many modern machine learning (ML) methods rely on embedding models to learn vector representations (embeddings) for a set of entities (embedding tables). As increasingly diverse ML applications utilize embedding models and embedding tables continue to grow in size and number, there has been a surge in the ad-hoc development of specialized frameworks targeted to train large embedding models for specific tasks. Although the scalability issues that arise in different embedding model training tasks are similar, each of these frameworks independently reinvents and customizes storage components for specific tasks, leading to substantial duplicated engineering efforts in both development and deployment. This paper presents MLKV, an efficient, extensible, and reusable data storage framework designed to address the scalability challenges in embedding model training, specifically data stall and staleness. MLKV augments disk-based key-value storage by democratizing optimizations that were previously exclusive to individual specialized frameworks and provides easy-to-use interfaces for embedding model training tasks. Extensive experiments on open-source workloads, as well as applications in eBay's payment transaction risk detection and seller payment risk detection, show that MLKV outperforms offloading strategies built on top of industrial-strength key-value stores by 1.6-12.6x. MLKV is open-source at https://github.com/llm-db/MLKV.
