Table of Contents
Fetching ...

MISFEAT: Feature Selection for Subgroups with Systematic Missing Data

Bar Genossar, Thinh On, Md. Mouinul Islam, Ben Eliav, Senjuti Basu Roy, Avigdor Gal

TL;DR

MISFEAT tackles subgroup-aware feature selection under systematic missing data by predicting the joint mutual information $I(F^m;Y)$ for $m$-sized feature subsets within each subgroup. It constructs a multiplex lattice graph with subgroup-specific feature lattices and trains a heterogeneous GNN (GraphSAGE) to predict MI scores, using RANDWALK-based sampling and entropy-based pre-computations to manage the combinatorial explosion. The model outputs top-$K$ feature subsets for every subgroup and leverages the MI upward closure property to improve predictions without exhaustive enumeration. Empirical results on real-world and synthetic data demonstrate that MISFEAT outperforms baselines, remains robust to high levels of systematic missingness, and scales to many subgroups and features, with the code made publicly available.

Abstract

We investigate the problem of selecting features for datasets that can be naturally partitioned into subgroups (e.g., according to socio-demographic groups and age), each with its own dominant set of features. Within this subgroup-oriented framework, we address the challenge of systematic missing data, a scenario in which some feature values are missing for all tuples of a subgroup, due to flawed data integration, regulatory constraints, or privacy concerns. Feature selection is governed by finding mutual Information, a popular quantification of correlation, between features and a target variable. Our goal is to identify top-K feature subsets of some fixed size with the highest joint mutual information with a target variable. In the presence of systematic missing data, the closed form of mutual information could not simply be applied. We argue that in such a setting, leveraging relationships between available feature mutual information within a subgroup or across subgroups can assist inferring missing mutual information values. We propose a generalizable model based on heterogeneous graph neural network to identify interdependencies between feature-subgroup-target variable connections by modeling it as a multiplex graph, and employing information propagation between its nodes. We address two distinct scalability challenges related to training and propose principled solutions to tackle them. Through an extensive empirical evaluation, we demonstrate the efficacy of the proposed solutions both qualitatively and running time wise.

MISFEAT: Feature Selection for Subgroups with Systematic Missing Data

TL;DR

MISFEAT tackles subgroup-aware feature selection under systematic missing data by predicting the joint mutual information for -sized feature subsets within each subgroup. It constructs a multiplex lattice graph with subgroup-specific feature lattices and trains a heterogeneous GNN (GraphSAGE) to predict MI scores, using RANDWALK-based sampling and entropy-based pre-computations to manage the combinatorial explosion. The model outputs top- feature subsets for every subgroup and leverages the MI upward closure property to improve predictions without exhaustive enumeration. Empirical results on real-world and synthetic data demonstrate that MISFEAT outperforms baselines, remains robust to high levels of systematic missingness, and scales to many subgroups and features, with the code made publicly available.

Abstract

We investigate the problem of selecting features for datasets that can be naturally partitioned into subgroups (e.g., according to socio-demographic groups and age), each with its own dominant set of features. Within this subgroup-oriented framework, we address the challenge of systematic missing data, a scenario in which some feature values are missing for all tuples of a subgroup, due to flawed data integration, regulatory constraints, or privacy concerns. Feature selection is governed by finding mutual Information, a popular quantification of correlation, between features and a target variable. Our goal is to identify top-K feature subsets of some fixed size with the highest joint mutual information with a target variable. In the presence of systematic missing data, the closed form of mutual information could not simply be applied. We argue that in such a setting, leveraging relationships between available feature mutual information within a subgroup or across subgroups can assist inferring missing mutual information values. We propose a generalizable model based on heterogeneous graph neural network to identify interdependencies between feature-subgroup-target variable connections by modeling it as a multiplex graph, and employing information propagation between its nodes. We address two distinct scalability challenges related to training and propose principled solutions to tackle them. Through an extensive empirical evaluation, we demonstrate the efficacy of the proposed solutions both qualitatively and running time wise.

Paper Structure

This paper contains 26 sections, 2 theorems, 15 equations, 9 figures, 4 tables, 2 algorithms.

Key Result

Lemma 1

The size of the subgroup feature lattice graph $\mathbb{G}_i=\left( \mathbb{V}_i, \mathbb{E}_i\right)$ is exponential to the number of features both in terms of the number of nodes $|\mathbb{V}_i|$ and the number of edges $|\mathbb{E}_i|$.

Figures (9)

  • Figure 1: An Illustration of a single, four level lattice graph.
  • Figure 2: An Illustration of a multiple lattice graph structure.
  • Figure 3: ( Attrition dataset) nDCG and Precision with increasing $p$. MISFEAT is consistently more effective with smaller $\Delta$ values.
  • Figure 4: ( Mobile dataset) nDCG and Precision with increasing $p$. MISFEAT is consistently more effective with smaller $\Delta$ values.
  • Figure 5: ( Loan dataset) nDCG and Precision with increasing $p$. For both algorithms, $\Delta$ is higher (although MISFEAT performs better) due to low correlation of features across subgroups.
  • ...and 4 more figures

Theorems & Definitions (16)

  • Example 1
  • Example 2
  • Definition 1
  • Definition 2
  • Definition 3: Subgroup feature lattice graph
  • Example 3
  • Example 4
  • Lemma 1
  • proof
  • Definition 4: Multiple lattice graph
  • ...and 6 more