Table of Contents
Fetching ...

Mayfly: Private Aggregate Insights from Ephemeral Streams of On-Device User Data

Christopher Bian, Albert Cheu, Stanislav Chiknavaryan, Zoe Gong, Marco Gruteser, Oliver Guinan, Yannis Guzman, Peter Kairouz, Artem Lagzdin, Ryan McKenna, Grace Ni, Edo Roth, Maya Spivak, Timon Van Overveldt, Ren Yi

TL;DR

Mayfly solves private, scalable analytics over ephemeral on-device streams by combining on-device data minimization, immediate in-memory cross-device aggregation, and streaming differential privacy. It supports programmable SQL-like queries over time-windowed data, with a novel DP mechanism tailored to high-dimensional location-based Group-By-Sum workloads, achieving $\,\epsilon = 2$ per device per week in a production deployment across hundreds of millions of devices. The system emphasizes ephemerality and data minimization while delivering utility through a latency-tolerant anonymous release and hierarchical aggregation. The results on a sustainability use case demonstrate strong device reach and favorable utility, with a detailed evaluation of DP mechanisms showing Activity+Metric Scaling as a key contributor to utility under tight privacy budgets. The work highlights practical tradeoffs between privacy, latency, and device resource usage, and points to future opportunities in confidential federated computations and broader domain applicability beyond location data.

Abstract

This paper introduces Mayfly, a federated analytics approach enabling aggregate queries over ephemeral on-device data streams without central persistence of sensitive user data. Mayfly minimizes data via on-device windowing and contribution bounding through SQL-programmability, anonymizes user data via streaming differential privacy (DP), and mandates immediate in-memory cross-device aggregation on the server -- ensuring only privatized aggregates are revealed to data analysts. Deployed for a sustainability use case estimating transportation carbon emissions from private location data, Mayfly computed over 4 million statistics across more than 500 million devices with a per-device, per-week DP $\varepsilon = 2$ while meeting strict data utility requirements. To achieve this, we designed a new DP mechanism for Group-By-Sum workloads leveraging statistical properties of location data, with potential applicability to other domains.

Mayfly: Private Aggregate Insights from Ephemeral Streams of On-Device User Data

TL;DR

Mayfly solves private, scalable analytics over ephemeral on-device streams by combining on-device data minimization, immediate in-memory cross-device aggregation, and streaming differential privacy. It supports programmable SQL-like queries over time-windowed data, with a novel DP mechanism tailored to high-dimensional location-based Group-By-Sum workloads, achieving per device per week in a production deployment across hundreds of millions of devices. The system emphasizes ephemerality and data minimization while delivering utility through a latency-tolerant anonymous release and hierarchical aggregation. The results on a sustainability use case demonstrate strong device reach and favorable utility, with a detailed evaluation of DP mechanisms showing Activity+Metric Scaling as a key contributor to utility under tight privacy budgets. The work highlights practical tradeoffs between privacy, latency, and device resource usage, and points to future opportunities in confidential federated computations and broader domain applicability beyond location data.

Abstract

This paper introduces Mayfly, a federated analytics approach enabling aggregate queries over ephemeral on-device data streams without central persistence of sensitive user data. Mayfly minimizes data via on-device windowing and contribution bounding through SQL-programmability, anonymizes user data via streaming differential privacy (DP), and mandates immediate in-memory cross-device aggregation on the server -- ensuring only privatized aggregates are revealed to data analysts. Deployed for a sustainability use case estimating transportation carbon emissions from private location data, Mayfly computed over 4 million statistics across more than 500 million devices with a per-device, per-week DP while meeting strict data utility requirements. To achieve this, we designed a new DP mechanism for Group-By-Sum workloads leveraging statistical properties of location data, with potential applicability to other domains.

Paper Structure

This paper contains 28 sections, 2 equations, 8 figures, 2 tables, 4 algorithms.

Figures (8)

  • Figure 1: Overview of all system components and actors. Analyst queries (A) are converted into lightweight SQLite-compatible subqueries for execution on device (B). This includes some DP parameters, and allows devices to locally bound their contributions, and upload focused updates (C). The federated server performs an immediate aggregation (D), resulting in server-side aggregates that have a strict time-to-live. Anonymization (E) adds noise and prepares the data for internal and external release, with some optional post-processing to further minimize the data and give focused results for external consumers.
  • Figure 2: Server and device timelines for multiple time windows.
  • Figure 3: Task assignment architecture and immediate aggregation workflow.
  • Figure 4: Aggregation Service architecture.
  • Figure 5: A complete overview of the data collection and processing steps, including the "Activity + Metric Scaling Mechanism" (with two sample devices shown). Data starts on device, is scaled and clipped locally, before being aggregated hierarchically on the server. Once aggregated, we add Laplace noise centrally, and perform post-processing steps of descaling and thresholding, before releasing the private data to downstream emissions calculations and then to relevant stakeholders.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Example 2.1: Transportation Data
  • Example 3.1
  • Example 4.1: Product A Workload
  • Example 4.2: Activity + Metric Scaling