PAPAYA Federated Analytics Stack: Engineering Privacy, Scalability and Practicality
Harish Srinivas, Graham Cormode, Mehrdad Honarkhah, Samuel Lurye, Jonathan Hehir, Lunwen He, George Hong, Ahmed Magdy, Dzmitry Huba, Kaikai Wang, Shen Guo, Shoubhik Bhattacharya
TL;DR
The paper tackles credible privacy, accuracy, and scalability challenges in cross-device analytics by introducing the Papaya Federated Analytics stack, a three-zone architecture that leverages trusted execution environments (TEEs) and one-shot private aggregation to compute analytics across billions of devices with minimal data leakage. It presents an expressive, SQL-like on-device data transformation model combined with secure sum and thresholding (SST) within TEEs, and supports central, local, and distributed differential privacy, including periodic data releases. A production-scale evaluation on nearly $10^8$ devices demonstrates feasible iteration speed, predictable query load, and accurate histogram/quantile results despite heterogeneous device conditions and network latency. The work shows that large-scale, privacy-preserving federated analytics is practical and can achieve strong privacy guarantees with minimal utility loss, enabling real-world analytics while protecting user data.
Abstract
Cross-device Federated Analytics (FA) is a distributed computation paradigm designed to answer analytics queries about and derive insights from data held locally on users' devices. On-device computations combined with other privacy and security measures ensure that only minimal data is transmitted off-device, achieving a high standard of data protection. Despite FA's broad relevance, the applicability of existing FA systems is limited by compromised accuracy; lack of flexibility for data analytics; and an inability to scale effectively. In this paper, we describe our approach to combine privacy, scalability, and practicality to build and deploy a system that overcomes these limitations. Our FA system leverages trusted execution environments (TEEs) and optimizes the use of on-device computing resources to facilitate federated data processing across large fleets of devices, while ensuring robust, defensible, and verifiable privacy safeguards. We focus on federated analytics (statistics and monitoring), in contrast to systems for federated learning (ML workloads), and we flag the key differences.
