Table of Contents
Fetching ...

A Bag of Tricks for Scaling CPU-based Deep FFMs to more than 300m Predictions per Second

Blaž Škrlj, Benjamin Ben-Shalom, Grega Gašperšič, Adi Schwartz, Ramzi Hoseisi, Naama Ziporin, Davorin Kopič, Andraž Tori

TL;DR

This work addresses scalable, low-footprint CTR prediction on CPU by presenting FW, a Rust-based Deep FFM system with training and serving separated across data centers. It combines techniques such as warm-up acceleration, Hogwild-based training, sparse gradient updates, context caching, SIMD-accelerated forward passes, and a patching plus 16-bit weight quantization scheme to reduce bandwidth. The results show practical, high-throughput CPU deployments achieving hundreds of millions of predictions per second while dramatically reducing model-update bandwidth. The work also contributes an open-source implementation, enabling broader adoption and further research into CPU-only Deep FFMs.

Abstract

Field-aware Factorization Machines (FFMs) have emerged as a powerful model for click-through rate prediction, particularly excelling in capturing complex feature interactions. In this work, we present an in-depth analysis of our in-house, Rust-based Deep FFM implementation, and detail its deployment on a CPU-only, multi-data-center scale. We overview key optimizations devised for both training and inference, demonstrated by previously unpublished benchmark results in efficient model search and online training. Further, we detail an in-house weight quantization that resulted in more than an order of magnitude reduction in bandwidth footprint related to weight transfers across data-centres. We disclose the engine and associated techniques under an open-source license to contribute to the broader machine learning community. This paper showcases one of the first successful CPU-only deployments of Deep FFMs at such scale, marking a significant stride in practical, low-footprint click-through rate prediction methodologies.

A Bag of Tricks for Scaling CPU-based Deep FFMs to more than 300m Predictions per Second

TL;DR

This work addresses scalable, low-footprint CTR prediction on CPU by presenting FW, a Rust-based Deep FFM system with training and serving separated across data centers. It combines techniques such as warm-up acceleration, Hogwild-based training, sparse gradient updates, context caching, SIMD-accelerated forward passes, and a patching plus 16-bit weight quantization scheme to reduce bandwidth. The results show practical, high-throughput CPU deployments achieving hundreds of millions of predictions per second while dramatically reducing model-update bandwidth. The work also contributes an open-source implementation, enabling broader adoption and further research into CPU-only Deep FFMs.

Abstract

Field-aware Factorization Machines (FFMs) have emerged as a powerful model for click-through rate prediction, particularly excelling in capturing complex feature interactions. In this work, we present an in-depth analysis of our in-house, Rust-based Deep FFM implementation, and detail its deployment on a CPU-only, multi-data-center scale. We overview key optimizations devised for both training and inference, demonstrated by previously unpublished benchmark results in efficient model search and online training. Further, we detail an in-house weight quantization that resulted in more than an order of magnitude reduction in bandwidth footprint related to weight transfers across data-centres. We disclose the engine and associated techniques under an open-source license to contribute to the broader machine learning community. This paper showcases one of the first successful CPU-only deployments of Deep FFMs at such scale, marking a significant stride in practical, low-footprint click-through rate prediction methodologies.
Paper Structure (12 sections, 5 equations, 6 figures, 4 tables)

This paper contains 12 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the key topics discussed in this paper. Performance optimizations that span model search (AutoML), online model training, storage, transfer and serving are discussed.
  • Figure 2: Architecture of implemented CPU-based DeepFFMs. Main blocks are the neural network (gray), logistic (yellow) and FFM (red) ones.
  • Figure 3: Visualization of overall performance of different algorithms (single-pass) across different benchmark data sets (top-down: Criteo, Avazu, kddcup2012. Visualizations show traces of all trained models (per engine).
  • Figure 4: Impact of context caching on inference time.
  • Figure 5: Relative impact of SIMD-enabled (blue, after drop) vs. SIMD-disabled (purple) FW in production (inference).
  • ...and 1 more figures