Table of Contents
Fetching ...

Private federated discovery of out-of-vocabulary words for Gboard

Ziteng Sun, Peter Kairouz, Haicheng Sun, Adria Gascon, Ananda Theertha Suresh

TL;DR

This work tackles private discovery of frequently typed out-of-vocabulary words for Gboard in a federated setting, coupling local differential privacy with secure aggregation to achieve central DP guarantees. It introduces a trie-based heavy hitter discovery algorithm that operates in multiple rounds with per-layer LDP and uses a pair of local samplers to cap user contributions, effectively transforming open-domain learning into a sequence of closed-domain steps. In production-like settings for en-US, it achieves an item-level central DP of $\varepsilon_{central}=0.315$ with $\delta=10^{-10}$ and demonstrates 16.8\% coverage of OOV words, with noticeable gains from RandomSampling and multi-pass learning. The results illustrate a practical path to privacy-preserving vocabulary expansion on mobile keyboards and highlight tradeoffs between privacy budgets, sampler choice, and iteration count, informing deployment strategies and future enhancements such as user-level DP and TEEs for verifiable privacy claims.

Abstract

The vocabulary of language models in Gboard, Google's keyboard application, plays a crucial role for improving user experience. One way to improve the vocabulary is to discover frequently typed out-of-vocabulary (OOV) words on user devices. This task requires strong privacy protection due to the sensitive nature of user input data. In this report, we present a private OOV discovery algorithm for Gboard, which builds on recent advances in private federated analytics. The system offers local differential privacy (LDP) guarantees for user contributed words. With anonymous aggregation, the final released result would satisfy central differential privacy guarantees with $\varepsilon = 0.315, δ= 10^{-10}$ for OOV discovery in en-US (English in United States).

Private federated discovery of out-of-vocabulary words for Gboard

TL;DR

This work tackles private discovery of frequently typed out-of-vocabulary words for Gboard in a federated setting, coupling local differential privacy with secure aggregation to achieve central DP guarantees. It introduces a trie-based heavy hitter discovery algorithm that operates in multiple rounds with per-layer LDP and uses a pair of local samplers to cap user contributions, effectively transforming open-domain learning into a sequence of closed-domain steps. In production-like settings for en-US, it achieves an item-level central DP of with and demonstrates 16.8\% coverage of OOV words, with noticeable gains from RandomSampling and multi-pass learning. The results illustrate a practical path to privacy-preserving vocabulary expansion on mobile keyboards and highlight tradeoffs between privacy budgets, sampler choice, and iteration count, informing deployment strategies and future enhancements such as user-level DP and TEEs for verifiable privacy claims.

Abstract

The vocabulary of language models in Gboard, Google's keyboard application, plays a crucial role for improving user experience. One way to improve the vocabulary is to discover frequently typed out-of-vocabulary (OOV) words on user devices. This task requires strong privacy protection due to the sensitive nature of user input data. In this report, we present a private OOV discovery algorithm for Gboard, which builds on recent advances in private federated analytics. The system offers local differential privacy (LDP) guarantees for user contributed words. With anonymous aggregation, the final released result would satisfy central differential privacy guarantees with for OOV discovery in en-US (English in United States).
Paper Structure (17 sections, 3 theorems, 7 equations, 1 figure, 1 table, 2 algorithms)

This paper contains 17 sections, 3 theorems, 7 equations, 1 figure, 1 table, 2 algorithms.

Key Result

Theorem 1

With either GreedySampling or RandomSampling as the local sampler, alg:ldp_triehh is $\varepsilon$-LDP at item-level.

Figures (1)

  • Figure 1: The estimated coverage of the recovered OOVs vs local privacy parameter $\varepsilon$.

Theorems & Definitions (6)

  • Definition 1: Central Differential Privacy (DP) dwork2006calibrating
  • Definition 2: Local differential privacy (LDP) kasiviswanathan2008ldp
  • Theorem 1
  • Theorem 2
  • Definition 3: Approximate $k$-anonymity
  • Lemma 1