Table of Contents
Fetching ...

Algorithmic Data Minimization for Machine Learning over Internet-of-Things Data Streams

Ted Shaowang, Shinan Liu, Jonatas Marques, Nick Feamster, Sanjay Krishnan

TL;DR

The work addresses privacy risks in IoT data by formalizing data minimization as a two-player Re-identification Game between a provider and an adversary. It introduces a practical API and a two-stage optimization (greedy preselection followed by exhaustive search) to minimize identifiability while preserving task accuracy, validated across seven IoT datasets. Key contributions include a formal model, a suite of scalable solution strategies, and actionable best-practice recommendations that favor dense feature representations and careful feature preselection. The approach demonstrates measurable identifiability reductions with minimal accuracy loss, offering a viable path for privacy-preserving IoT analytics with real-world applicability.

Abstract

Machine learning can analyze vast amounts of data generated by IoT devices to identify patterns, make predictions, and enable real-time decision-making. By processing sensor data, machine learning models can optimize processes, improve efficiency, and enhance personalized user experiences in smart systems. However, IoT systems are often deployed in sensitive environments such as households and offices, where they may inadvertently expose identifiable information, including location, habits, and personal identifiers. This raises significant privacy concerns, necessitating the application of data minimization -- a foundational principle in emerging data regulations, which mandates that service providers only collect data that is directly relevant and necessary for a specified purpose. Despite its importance, data minimization lacks a precise technical definition in the context of sensor data, where collections of weak signals make it challenging to apply a binary "relevant and necessary" rule. This paper provides a technical interpretation of data minimization in the context of sensor streams, explores practical methods for implementation, and addresses the challenges involved. Through our approach, we demonstrate that our framework can reduce user identifiability by up to 16.7% while maintaining accuracy loss below 1%, offering a viable path toward privacy-preserving IoT data processing.

Algorithmic Data Minimization for Machine Learning over Internet-of-Things Data Streams

TL;DR

The work addresses privacy risks in IoT data by formalizing data minimization as a two-player Re-identification Game between a provider and an adversary. It introduces a practical API and a two-stage optimization (greedy preselection followed by exhaustive search) to minimize identifiability while preserving task accuracy, validated across seven IoT datasets. Key contributions include a formal model, a suite of scalable solution strategies, and actionable best-practice recommendations that favor dense feature representations and careful feature preselection. The approach demonstrates measurable identifiability reductions with minimal accuracy loss, offering a viable path for privacy-preserving IoT analytics with real-world applicability.

Abstract

Machine learning can analyze vast amounts of data generated by IoT devices to identify patterns, make predictions, and enable real-time decision-making. By processing sensor data, machine learning models can optimize processes, improve efficiency, and enhance personalized user experiences in smart systems. However, IoT systems are often deployed in sensitive environments such as households and offices, where they may inadvertently expose identifiable information, including location, habits, and personal identifiers. This raises significant privacy concerns, necessitating the application of data minimization -- a foundational principle in emerging data regulations, which mandates that service providers only collect data that is directly relevant and necessary for a specified purpose. Despite its importance, data minimization lacks a precise technical definition in the context of sensor data, where collections of weak signals make it challenging to apply a binary "relevant and necessary" rule. This paper provides a technical interpretation of data minimization in the context of sensor streams, explores practical methods for implementation, and addresses the challenges involved. Through our approach, we demonstrate that our framework can reduce user identifiability by up to 16.7% while maintaining accuracy loss below 1%, offering a viable path toward privacy-preserving IoT data processing.

Paper Structure

This paper contains 29 sections, 5 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Example of accuracy-identifiability tradeoff on CIC-IDS 2017 dataset. Even though feature hashing (left) gets a higher relative effectiveness score, user might prefer the tradeoff of SHAP-based greedy by cost-to-value ratio (right).
  • Figure 2: Case study: Opportunity dataset.

Theorems & Definitions (2)

  • Definition 1: Data Minimization
  • Definition 2: Relative Effectiveness