Table of Contents
Fetching ...

Federated Learning Of Out-Of-Vocabulary Words

Mingqing Chen, Rajiv Mathews, Tom Ouyang, Françoise Beaufays

TL;DR

The paper addresses expanding mobile keyboard vocabularies by learning OOV words directly on user devices using federated learning, thereby preserving privacy. It introduces a character-level LSTM with CIFG, peephole connections, and a projection layer, trained with cross-entropy, and uses Monte Carlo sampling to generate OOV candidates. The approach is evaluated in a simulated FL setting on Reddit data and in real on-device FL across multiple languages, showing strong top-K prediction metrics (e.g., en_US top-3 accuracy around 55.8%, CE approx. 2.35) and high precision/recall for sampled OOV words in the simulation (e.g., 90.56%/81.22% for top $10^5$ words). The results demonstrate practical feasibility for privacy-preserving vocabulary expansion in mobile keyboards, while acknowledging privacy considerations such as potential memorization and the need for future filtering and privacy enhancements.

Abstract

We demonstrate that a character-level recurrent neural network is able to learn out-of-vocabulary (OOV) words under federated learning settings, for the purpose of expanding the vocabulary of a virtual keyboard for smartphones without exporting sensitive text to servers. High-frequency words can be sampled from the trained generative model by drawing from the joint posterior directly. We study the feasibility of the approach in two settings: (1) using simulated federated learning on a publicly available non-IID per-user dataset from a popular social networking website, (2) using federated learning on data hosted on user mobile devices. The model achieves good recall and precision compared to ground-truth OOV words in setting (1). With (2) we demonstrate the practicality of this approach by showing that we can learn meaningful OOV words with good character-level prediction accuracy and cross entropy loss.

Federated Learning Of Out-Of-Vocabulary Words

TL;DR

The paper addresses expanding mobile keyboard vocabularies by learning OOV words directly on user devices using federated learning, thereby preserving privacy. It introduces a character-level LSTM with CIFG, peephole connections, and a projection layer, trained with cross-entropy, and uses Monte Carlo sampling to generate OOV candidates. The approach is evaluated in a simulated FL setting on Reddit data and in real on-device FL across multiple languages, showing strong top-K prediction metrics (e.g., en_US top-3 accuracy around 55.8%, CE approx. 2.35) and high precision/recall for sampled OOV words in the simulation (e.g., 90.56%/81.22% for top words). The results demonstrate practical feasibility for privacy-preserving vocabulary expansion in mobile keyboards, while acknowledging privacy considerations such as potential memorization and the need for future filtering and privacy enhancements.

Abstract

We demonstrate that a character-level recurrent neural network is able to learn out-of-vocabulary (OOV) words under federated learning settings, for the purpose of expanding the vocabulary of a virtual keyboard for smartphones without exporting sensitive text to servers. High-frequency words can be sampled from the trained generative model by drawing from the joint posterior directly. We study the feasibility of the approach in two settings: (1) using simulated federated learning on a publicly available non-IID per-user dataset from a popular social networking website, (2) using federated learning on data hosted on user mobile devices. The model achieves good recall and precision compared to ground-truth OOV words in setting (1). With (2) we demonstrate the practicality of this approach by showing that we can learn meaningful OOV words with good character-level prediction accuracy and cross entropy loss.

Paper Structure

This paper contains 14 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Monte Carlo sampling of OOV words from the LSTM model.
  • Figure 2: Precision vs. top-$K$ uniquely sampled words in simulated FL experiments.
  • Figure 4: Cross entropy loss on live client evaluation data for three different FL settings for en_US.
  • Figure 5: Top-3 Character-level prediction accuracy on live client evaluation data for three different FL settings for en_US.