Table of Contents
Fetching ...

Global Outlier Detection in a Federated Learning Setting with Isolation Forest

Daniele Malpetti, Laura Azzimonti

TL;DR

The paper tackles privacy-preserving global outlier detection in cross-silo federated learning by deploying a two-server architecture that operates on masked data. Clients jointly generate a masking transformation $M=Q S Q'$ and additive noise $R^i$, then share masked representations so a central detector can run IF or EIF on $X_{masked}$ without exposing data ownership, achieving results comparable to centralized IF on plain data. Key contributions include a secure protocol for seed agreement via Paillier, a structured data masking and transfer scheme, and a thorough analysis of privacy implications, including collusion scenarios and potential privacy enhancements. The approach demonstrates practical viability for preprocessing in FL pipelines and opens pathways to apply similar masking strategies to other anomaly detection tasks while preserving data confidentiality.

Abstract

We present a novel strategy for detecting global outliers in a federated learning setting, targeting in particular cross-silo scenarios. Our approach involves the use of two servers and the transmission of masked local data from clients to one of the servers. The masking of the data prevents the disclosure of sensitive information while still permitting the identification of outliers. Moreover, to further safeguard privacy, a permutation mechanism is implemented so that the server does not know which client owns any masked data point. The server performs outlier detection on the masked data, using either Isolation Forest or its extended version, and then communicates outlier information back to the clients, allowing them to identify and remove outliers in their local datasets before starting any subsequent federated model training. This approach provides comparable results to a centralized execution of Isolation Forest algorithms on plain data.

Global Outlier Detection in a Federated Learning Setting with Isolation Forest

TL;DR

The paper tackles privacy-preserving global outlier detection in cross-silo federated learning by deploying a two-server architecture that operates on masked data. Clients jointly generate a masking transformation and additive noise , then share masked representations so a central detector can run IF or EIF on without exposing data ownership, achieving results comparable to centralized IF on plain data. Key contributions include a secure protocol for seed agreement via Paillier, a structured data masking and transfer scheme, and a thorough analysis of privacy implications, including collusion scenarios and potential privacy enhancements. The approach demonstrates practical viability for preprocessing in FL pipelines and opens pathways to apply similar masking strategies to other anomaly detection tasks while preserving data confidentiality.

Abstract

We present a novel strategy for detecting global outliers in a federated learning setting, targeting in particular cross-silo scenarios. Our approach involves the use of two servers and the transmission of masked local data from clients to one of the servers. The masking of the data prevents the disclosure of sensitive information while still permitting the identification of outliers. Moreover, to further safeguard privacy, a permutation mechanism is implemented so that the server does not know which client owns any masked data point. The server performs outlier detection on the masked data, using either Isolation Forest or its extended version, and then communicates outlier information back to the clients, allowing them to identify and remove outliers in their local datasets before starting any subsequent federated model training. This approach provides comparable results to a centralized execution of Isolation Forest algorithms on plain data.
Paper Structure (19 sections, 2 figures, 1 table, 3 algorithms)

This paper contains 19 sections, 2 figures, 1 table, 3 algorithms.

Figures (2)

  • Figure 1: Pictorial representation of the steps conducted by IF and EIF to isolate an intlier and an outlier in a two-dimensional dataset.
  • Figure 2: Performance for outlier classification in different datasets, for both IF and EIF. For each case, a reference approach and four multiparty approaches (corresponding to four different values of the parameter $T$) are shown. Each box in the boxplots includes 100 runs.