Mayfly: Private Aggregate Insights from Ephemeral Streams of On-Device User Data
Christopher Bian, Albert Cheu, Stanislav Chiknavaryan, Zoe Gong, Marco Gruteser, Oliver Guinan, Yannis Guzman, Peter Kairouz, Artem Lagzdin, Ryan McKenna, Grace Ni, Edo Roth, Maya Spivak, Timon Van Overveldt, Ren Yi
TL;DR
Mayfly solves private, scalable analytics over ephemeral on-device streams by combining on-device data minimization, immediate in-memory cross-device aggregation, and streaming differential privacy. It supports programmable SQL-like queries over time-windowed data, with a novel DP mechanism tailored to high-dimensional location-based Group-By-Sum workloads, achieving $\,\epsilon = 2$ per device per week in a production deployment across hundreds of millions of devices. The system emphasizes ephemerality and data minimization while delivering utility through a latency-tolerant anonymous release and hierarchical aggregation. The results on a sustainability use case demonstrate strong device reach and favorable utility, with a detailed evaluation of DP mechanisms showing Activity+Metric Scaling as a key contributor to utility under tight privacy budgets. The work highlights practical tradeoffs between privacy, latency, and device resource usage, and points to future opportunities in confidential federated computations and broader domain applicability beyond location data.
Abstract
This paper introduces Mayfly, a federated analytics approach enabling aggregate queries over ephemeral on-device data streams without central persistence of sensitive user data. Mayfly minimizes data via on-device windowing and contribution bounding through SQL-programmability, anonymizes user data via streaming differential privacy (DP), and mandates immediate in-memory cross-device aggregation on the server -- ensuring only privatized aggregates are revealed to data analysts. Deployed for a sustainability use case estimating transportation carbon emissions from private location data, Mayfly computed over 4 million statistics across more than 500 million devices with a per-device, per-week DP $\varepsilon = 2$ while meeting strict data utility requirements. To achieve this, we designed a new DP mechanism for Group-By-Sum workloads leveraging statistical properties of location data, with potential applicability to other domains.
