Table of Contents
Fetching ...

Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems

Benjamin Coleman, Wang-Cheng Kang, Matthew Fahrbach, Ruoxi Wang, Lichan Hong, Ed H. Chi, Derek Zhiyuan Cheng

TL;DR

The paper addresses the challenge of learning embeddings for high-cardinality categorical features in web-scale SAR systems. It introduces Feature Multiplexing, a framework that shares a single embedding space across multiple features, and derives both theoretical insights (gradient decomposition and variance analysis) and empirical evidence of Pareto-optimal parameter-accuracy tradeoffs. Building on this, the authors propose Unified Embedding, a practical multiplexed approach that is deployed in industrial systems and yields significant offline and online gains across diverse domains. The work demonstrates that shared embeddings, when combined with careful training dynamics and hardware-friendly design, can dramatically simplify configuration and improve performance in large-scale production environments.

Abstract

Learning high-quality feature embeddings efficiently and effectively is critical for the performance of web-scale machine learning systems. A typical model ingests hundreds of features with vocabularies on the order of millions to billions of tokens. The standard approach is to represent each feature value as a d-dimensional embedding, introducing hundreds of billions of parameters for extremely high-cardinality features. This bottleneck has led to substantial progress in alternative embedding algorithms. Many of these methods, however, make the assumption that each feature uses an independent embedding table. This work introduces a simple yet highly effective framework, Feature Multiplexing, where one single representation space is used across many different categorical features. Our theoretical and empirical analysis reveals that multiplexed embeddings can be decomposed into components from each constituent feature, allowing models to distinguish between features. We show that multiplexed representations lead to Pareto-optimal parameter-accuracy tradeoffs for three public benchmark datasets. Further, we propose a highly practical approach called Unified Embedding with three major benefits: simplified feature configuration, strong adaptation to dynamic data distributions, and compatibility with modern hardware. Unified embedding gives significant improvements in offline and online metrics compared to highly competitive baselines across five web-scale search, ads, and recommender systems, where it serves billions of users across the world in industry-leading products.

Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems

TL;DR

The paper addresses the challenge of learning embeddings for high-cardinality categorical features in web-scale SAR systems. It introduces Feature Multiplexing, a framework that shares a single embedding space across multiple features, and derives both theoretical insights (gradient decomposition and variance analysis) and empirical evidence of Pareto-optimal parameter-accuracy tradeoffs. Building on this, the authors propose Unified Embedding, a practical multiplexed approach that is deployed in industrial systems and yields significant offline and online gains across diverse domains. The work demonstrates that shared embeddings, when combined with careful training dynamics and hardware-friendly design, can dramatically simplify configuration and improve performance in large-scale production environments.

Abstract

Learning high-quality feature embeddings efficiently and effectively is critical for the performance of web-scale machine learning systems. A typical model ingests hundreds of features with vocabularies on the order of millions to billions of tokens. The standard approach is to represent each feature value as a d-dimensional embedding, introducing hundreds of billions of parameters for extremely high-cardinality features. This bottleneck has led to substantial progress in alternative embedding algorithms. Many of these methods, however, make the assumption that each feature uses an independent embedding table. This work introduces a simple yet highly effective framework, Feature Multiplexing, where one single representation space is used across many different categorical features. Our theoretical and empirical analysis reveals that multiplexed embeddings can be decomposed into components from each constituent feature, allowing models to distinguish between features. We show that multiplexed representations lead to Pareto-optimal parameter-accuracy tradeoffs for three public benchmark datasets. Further, we propose a highly practical approach called Unified Embedding with three major benefits: simplified feature configuration, strong adaptation to dynamic data distributions, and compatibility with modern hardware. Unified embedding gives significant improvements in offline and online metrics compared to highly competitive baselines across five web-scale search, ads, and recommender systems, where it serves billions of users across the world in industry-leading products.
Paper Structure (46 sections, 3 theorems, 16 equations, 9 figures, 4 tables)

This paper contains 46 sections, 3 theorems, 16 equations, 9 figures, 4 tables.

Key Result

Proposition 4.1

For any $\mathbf{x}_1, \mathbf{y}_1 \in \{0, 1\}^{N_1}$ and $\mathbf{x}_2, \mathbf{y}_2 \in \{0, 1\}^{N_2}$, let $\mathbf{x} = [\mathbf{x}_1, \mathbf{x}_2]$, $\mathbf{y} = [\mathbf{y}_1, \mathbf{y}_2]$ denote their concatenations. Let $\mu_U$, $\mu_H$, $\sigma^2_U$, and $\sigma^2_H$ be the mean and

Figures (9)

  • Figure 1: Embedding methods for two categorical features. We highlight the lookup process for the first value $v_1$ of each feature. Hash tables randomly share representations within each feature, while Unified Embedding shares representations across features. To implement Unified Embedding with different dimensions (multi-size or variable-length), we perform multiple lookups and concatenate the results.
  • Figure 2: Single-layer neural embedding model with per-feature weights $\boldsymbol{\theta}_t$ (left). Mean embedding $\ell^2$-norm (middle) and mean angle between all pairs of weight vectors $\boldsymbol{\theta}_{t_1},\boldsymbol{\theta}_{t_2}$ (right) as a function of table size for Criteo across all 26 categorical features. Note that the horizontal axes are in log scale.
  • Figure 3: Pareto frontier of multiplexed and original (non-multiplexed) methods. Top-left is better.
  • Figure 4: Illustration of the logistic regression model, annotated with notation that corresponds to our theoretical analysis.
  • Figure 5: Illustration of baseline embedding methods, all which are compatible with feature multiplexing.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Definition 4.1
  • Proposition 4.1
  • Lemma A.1: weinberger2009feature
  • Proposition A.1
  • proof