Table of Contents
Fetching ...

Scalable Machine Learning Training Infrastructure for Online Ads Recommendation and Auction Scoring Modeling at Google

George Kurian, Somayeh Sardashti, Ryan Sims, Felix Berger, Gary Holt, Yang Li, Jeremiah Willcock, Kaiyuan Wang, Herve Quiroz, Abdulrahman Salem, Julian Grady

TL;DR

This work targets scalable, end-to-end training of Ads recommendation and auction scoring models at Google, addressing input generation, embedding management, and resource efficiency in a TPU-centric production environment. It introduces Shared Input Generation (SIG) to amortize feature transformations, advanced embedding partitioning and RPC coalescing to optimize TPU/CPU embedding workflows, and robust preemption and training-hold mechanisms to minimize wasted resources in shared datacenters. Empirical results from production-scale deployments show a 116% performance improvement and an 18% reduction in training costs across representative models, with significant gains in embedding throughput and input pipeline efficiency. The paper also discusses limitations and future directions, including adaptive memoization strategies, mutable data handling, and hybrid embedding/storage architectures to sustain continuous training at scale.

Abstract

Large-scale Ads recommendation and auction scoring models at Google scale demand immense computational resources. While specialized hardware like TPUs have improved linear algebra computations, bottlenecks persist in large-scale systems. This paper proposes solutions for three critical challenges that must be addressed for efficient end-to-end execution in a widely used production infrastructure: (1) Input Generation and Ingestion Pipeline: Efficiently transforming raw features (e.g., "search query") into numerical inputs and streaming them to TPUs; (2) Large Embedding Tables: Optimizing conversion of sparse features into dense floating-point vectors for neural network consumption; (3) Interruptions and Error Handling: Minimizing resource wastage in large-scale shared datacenters. To tackle these challenges, we propose a shared input generation technique to reduce computational load of input generation by amortizing costs across many models. Furthermore, we propose partitioning, pipelining, and RPC (Remote Procedure Call) coalescing software techniques to optimize embedding operations. To maintain efficiency at scale, we describe novel preemption notice and training hold mechanisms that minimize resource wastage, and ensure prompt error resolution. These techniques have demonstrated significant improvement in Google production, achieving a 116% performance boost and an 18% reduction in training costs across representative models.

Scalable Machine Learning Training Infrastructure for Online Ads Recommendation and Auction Scoring Modeling at Google

TL;DR

This work targets scalable, end-to-end training of Ads recommendation and auction scoring models at Google, addressing input generation, embedding management, and resource efficiency in a TPU-centric production environment. It introduces Shared Input Generation (SIG) to amortize feature transformations, advanced embedding partitioning and RPC coalescing to optimize TPU/CPU embedding workflows, and robust preemption and training-hold mechanisms to minimize wasted resources in shared datacenters. Empirical results from production-scale deployments show a 116% performance improvement and an 18% reduction in training costs across representative models, with significant gains in embedding throughput and input pipeline efficiency. The paper also discusses limitations and future directions, including adaptive memoization strategies, mutable data handling, and hybrid embedding/storage architectures to sustain continuous training at scale.

Abstract

Large-scale Ads recommendation and auction scoring models at Google scale demand immense computational resources. While specialized hardware like TPUs have improved linear algebra computations, bottlenecks persist in large-scale systems. This paper proposes solutions for three critical challenges that must be addressed for efficient end-to-end execution in a widely used production infrastructure: (1) Input Generation and Ingestion Pipeline: Efficiently transforming raw features (e.g., "search query") into numerical inputs and streaming them to TPUs; (2) Large Embedding Tables: Optimizing conversion of sparse features into dense floating-point vectors for neural network consumption; (3) Interruptions and Error Handling: Minimizing resource wastage in large-scale shared datacenters. To tackle these challenges, we propose a shared input generation technique to reduce computational load of input generation by amortizing costs across many models. Furthermore, we propose partitioning, pipelining, and RPC (Remote Procedure Call) coalescing software techniques to optimize embedding operations. To maintain efficiency at scale, we describe novel preemption notice and training hold mechanisms that minimize resource wastage, and ensure prompt error resolution. These techniques have demonstrated significant improvement in Google production, achieving a 116% performance boost and an 18% reduction in training costs across representative models.
Paper Structure (36 sections, 1 equation, 7 figures, 2 tables)

This paper contains 36 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: High level components of the training service, with corresponding dataflow.
  • Figure 2: Example connected component from an Ads recommendation model’s input transformation graph.
  • Figure 3: System design and architecture of the shared input generation (SIG) pipeline.
  • Figure 4: Embedding Table Partitioning strategies for TPU. There are three model parallelism strategies used: (a) table partitioning, (b) column partitioning, and (c) row partitioning. Each color coded block is an embedding table. The communication primitives used are AllToAll and AllGather. We employ a similar schematic mudigere2022software, to illustrate dataflow.
  • Figure 5: (a) Serialized and (b) Pipelined execution of TensorCore with SparseCore. Pipelining execution helps improve overlap, and thereby performance.
  • ...and 2 more figures