Private federated discovery of out-of-vocabulary words for Gboard

Ziteng Sun; Peter Kairouz; Haicheng Sun; Adria Gascon; Ananda Theertha Suresh

Private federated discovery of out-of-vocabulary words for Gboard

Ziteng Sun, Peter Kairouz, Haicheng Sun, Adria Gascon, Ananda Theertha Suresh

TL;DR

This work tackles private discovery of frequently typed out-of-vocabulary words for Gboard in a federated setting, coupling local differential privacy with secure aggregation to achieve central DP guarantees. It introduces a trie-based heavy hitter discovery algorithm that operates in multiple rounds with per-layer LDP and uses a pair of local samplers to cap user contributions, effectively transforming open-domain learning into a sequence of closed-domain steps. In production-like settings for en-US, it achieves an item-level central DP of $\varepsilon_{central}=0.315$ with $\delta=10^{-10}$ and demonstrates 16.8\% coverage of OOV words, with noticeable gains from RandomSampling and multi-pass learning. The results illustrate a practical path to privacy-preserving vocabulary expansion on mobile keyboards and highlight tradeoffs between privacy budgets, sampler choice, and iteration count, informing deployment strategies and future enhancements such as user-level DP and TEEs for verifiable privacy claims.

Abstract

The vocabulary of language models in Gboard, Google's keyboard application, plays a crucial role for improving user experience. One way to improve the vocabulary is to discover frequently typed out-of-vocabulary (OOV) words on user devices. This task requires strong privacy protection due to the sensitive nature of user input data. In this report, we present a private OOV discovery algorithm for Gboard, which builds on recent advances in private federated analytics. The system offers local differential privacy (LDP) guarantees for user contributed words. With anonymous aggregation, the final released result would satisfy central differential privacy guarantees with $\varepsilon = 0.315, δ= 10^{-10}$ for OOV discovery in en-US (English in United States).

Private federated discovery of out-of-vocabulary words for Gboard

TL;DR

with

and demonstrates 16.8\% coverage of OOV words, with noticeable gains from RandomSampling and multi-pass learning. The results illustrate a practical path to privacy-preserving vocabulary expansion on mobile keyboards and highlight tradeoffs between privacy budgets, sampler choice, and iteration count, informing deployment strategies and future enhancements such as user-level DP and TEEs for verifiable privacy claims.

Abstract

for OOV discovery in en-US (English in United States).

Paper Structure (17 sections, 3 theorems, 7 equations, 1 figure, 1 table, 2 algorithms)

This paper contains 17 sections, 3 theorems, 7 equations, 1 figure, 1 table, 2 algorithms.

Introduction
Private federated heavy hitter discovery
Differential privacy.
Trie-based heavy hitters with local differential privacy
Local dataset sampler.
Multiple-pass algorithm.
Privacy implication of \ref{['alg:ldp_triehh']}.
Aggregation and access control on the server.
Discussion on (approximate) $k$-anonymity.
Case study of OOV discovery in Gboard inputs
Privacy parameter.
Choice of the sampler and number of passes.
OOV discovery in production population (en-US).
Discussion
Acknowledgement
...and 2 more sections

Key Result

Theorem 1

With either GreedySampling or RandomSampling as the local sampler, alg:ldp_triehh is $\varepsilon$-LDP at item-level.

Figures (1)

Figure 1: The estimated coverage of the recovered OOVs vs local privacy parameter $\varepsilon$.

Theorems & Definitions (6)

Definition 1: Central Differential Privacy (DP) dwork2006calibrating
Definition 2: Local differential privacy (LDP) kasiviswanathan2008ldp
Theorem 1
Theorem 2
Definition 3: Approximate $k$-anonymity
Lemma 1

Private federated discovery of out-of-vocabulary words for Gboard

TL;DR

Abstract

Private federated discovery of out-of-vocabulary words for Gboard

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (6)