Table of Contents
Fetching ...

Causal Multi-Label Feature Selection in Federated Setting

Yukun Song, Dayuan Cao, Jiali Miao, Shuai Yang, Kui Yu

TL;DR

FedCMFS tackles causal multi-label feature selection under data privacy constraints by introducing a horizontal federated framework with three subroutines: FedCFL learns local causal parents and children for each label, FedCFR retrieves potentially missed causal features, and FedCFC corrects false positives using DAG symmetry. The approach aggregates local CI results with client-weighted averages, enabling global PC(Y) construction without sharing raw data. Empirical results across eight real datasets and six metrics show FedCMFS achieving the best average ranking, performing especially well on high-dimensional data, and benefiting from GPU-accelerated CI tests to reduce runtime. The work advances privacy-preserving causal feature selection in federated, multi-label scenarios and suggests future work for improving performance in small-sample regimes.

Abstract

Multi-label feature selection serves as an effective mean for dealing with high-dimensional multi-label data. To achieve satisfactory performance, existing methods for multi-label feature selection often require the centralization of substantial data from multiple sources. However, in Federated setting, centralizing data from all sources and merging them into a single dataset is not feasible. To tackle this issue, in this paper, we study a challenging problem of causal multi-label feature selection in federated setting and propose a Federated Causal Multi-label Feature Selection (FedCMFS) algorithm with three novel subroutines. Specifically, FedCMFS first uses the FedCFL subroutine that considers the correlations among label-label, label-feature, and feature-feature to learn the relevant features (candidate parents and children) of each class label while preserving data privacy without centralizing data. Second, FedCMFS employs the FedCFR subroutine to selectively recover the missed true relevant features. Finally, FedCMFS utilizes the FedCFC subroutine to remove false relevant features. The extensive experiments on 8 datasets have shown that FedCMFS is effect for causal multi-label feature selection in federated setting.

Causal Multi-Label Feature Selection in Federated Setting

TL;DR

FedCMFS tackles causal multi-label feature selection under data privacy constraints by introducing a horizontal federated framework with three subroutines: FedCFL learns local causal parents and children for each label, FedCFR retrieves potentially missed causal features, and FedCFC corrects false positives using DAG symmetry. The approach aggregates local CI results with client-weighted averages, enabling global PC(Y) construction without sharing raw data. Empirical results across eight real datasets and six metrics show FedCMFS achieving the best average ranking, performing especially well on high-dimensional data, and benefiting from GPU-accelerated CI tests to reduce runtime. The work advances privacy-preserving causal feature selection in federated, multi-label scenarios and suggests future work for improving performance in small-sample regimes.

Abstract

Multi-label feature selection serves as an effective mean for dealing with high-dimensional multi-label data. To achieve satisfactory performance, existing methods for multi-label feature selection often require the centralization of substantial data from multiple sources. However, in Federated setting, centralizing data from all sources and merging them into a single dataset is not feasible. To tackle this issue, in this paper, we study a challenging problem of causal multi-label feature selection in federated setting and propose a Federated Causal Multi-label Feature Selection (FedCMFS) algorithm with three novel subroutines. Specifically, FedCMFS first uses the FedCFL subroutine that considers the correlations among label-label, label-feature, and feature-feature to learn the relevant features (candidate parents and children) of each class label while preserving data privacy without centralizing data. Second, FedCMFS employs the FedCFR subroutine to selectively recover the missed true relevant features. Finally, FedCMFS utilizes the FedCFC subroutine to remove false relevant features. The extensive experiments on 8 datasets have shown that FedCMFS is effect for causal multi-label feature selection in federated setting.
Paper Structure (19 sections, 1 theorem, 10 equations, 6 figures, 11 tables, 4 algorithms)

This paper contains 19 sections, 1 theorem, 10 equations, 6 figures, 11 tables, 4 algorithms.

Key Result

Theorem 3.1

pearl2009causality In a DAG, given the MB of variable $V_i$,$MB(V_i)$, for $\forall V_j\in V\setminus(MB(V_i)\cup V_i)$, $V_i$ is conditionally independent of $V_j$ given $MB(V_i)$.

Figures (6)

  • Figure 1: The MB of node A consisting of B,C,D,E. The PC set of node A comprising B, C, and D.
  • Figure 2: Phase I of FedCFL: Green and yellow are labels and features, respectively, and the figure shows the computation, transmission, and aggregation of a dataset containing three labels and four features. The absence of color and the presence of other colors in the squares signify that each client independently calculates the correlation value and P value between each label ($Y_i$) and all nodes excluding it ($V_k\in V\setminus Y_i$) on a local level.
  • Figure 3: Phase II of FedCFL: the server sends triplet $<Y_i,V_k,CS>$ to determine conditional independence, each client computes and returns the P value $P_{n<Y_i,V_k,CS>}$, culminating in the aggregation of the weighted P value $P_{<Y_i,V_k,CS>}$.
  • Figure 4: The label-label correlations lead to missing true causal features.
  • Figure 5: Parameter sensitivity analysis of FedCMFS.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Theorem 3.1