Table of Contents
Fetching ...

Estimating Conditional Mutual Information for Dynamic Feature Selection

Soham Gadgil, Ian Covert, Su-In Lee

TL;DR

This work tackles dynamic feature selection by reframing CMI-based feature acquisition as a discriminative estimation problem. It introduces DIME, which jointly trains a predictor and a per-feature CMI estimator to recover $I({\mathbf{y}}; {\mathbf{x}}_i \mid x_S)$ without generative models, and it extends the framework to handle prior information, non-uniform feature costs, and variable budgets. The authors prove that, at optimality, the value network recovers the CMI (or the corresponding loss-reduction quantity) and demonstrate consistent gains over state-of-the-art methods across tabular and image datasets, including ViT-based architectures that better handle partial inputs. The approach yields flexible stopping criteria (budget, confidence, or penalized), enabling improved cost-accuracy tradeoffs with practical implications for cost-sensitive deployment in domains like medical diagnosis and histopathology.

Abstract

Dynamic feature selection, where we sequentially query features to make accurate predictions with a minimal budget, is a promising paradigm to reduce feature acquisition costs and provide transparency into a model's predictions. The problem is challenging, however, as it requires both predicting with arbitrary feature sets and learning a policy to identify valuable selections. Here, we take an information-theoretic perspective and prioritize features based on their mutual information with the response variable. The main challenge is implementing this policy, and we design a new approach that estimates the mutual information in a discriminative rather than generative fashion. Building on our approach, we then introduce several further improvements: allowing variable feature budgets across samples, enabling non-uniform feature costs, incorporating prior information, and exploring modern architectures to handle partial inputs. Our experiments show that our method provides consistent gains over recent methods across a variety of datasets.

Estimating Conditional Mutual Information for Dynamic Feature Selection

TL;DR

This work tackles dynamic feature selection by reframing CMI-based feature acquisition as a discriminative estimation problem. It introduces DIME, which jointly trains a predictor and a per-feature CMI estimator to recover without generative models, and it extends the framework to handle prior information, non-uniform feature costs, and variable budgets. The authors prove that, at optimality, the value network recovers the CMI (or the corresponding loss-reduction quantity) and demonstrate consistent gains over state-of-the-art methods across tabular and image datasets, including ViT-based architectures that better handle partial inputs. The approach yields flexible stopping criteria (budget, confidence, or penalized), enabling improved cost-accuracy tradeoffs with practical implications for cost-sensitive deployment in domains like medical diagnosis and histopathology.

Abstract

Dynamic feature selection, where we sequentially query features to make accurate predictions with a minimal budget, is a promising paradigm to reduce feature acquisition costs and provide transparency into a model's predictions. The problem is challenging, however, as it requires both predicting with arbitrary feature sets and learning a policy to identify valuable selections. Here, we take an information-theoretic perspective and prioritize features based on their mutual information with the response variable. The main challenge is implementing this policy, and we design a new approach that estimates the mutual information in a discriminative rather than generative fashion. Building on our approach, we then introduce several further improvements: allowing variable feature budgets across samples, enabling non-uniform feature costs, incorporating prior information, and exploring modern architectures to handle partial inputs. Our experiments show that our method provides consistent gains over recent methods across a variety of datasets.
Paper Structure (22 sections, 12 theorems, 32 equations, 19 figures, 2 tables, 2 algorithms)

This paper contains 22 sections, 12 theorems, 32 equations, 19 figures, 2 tables, 2 algorithms.

Key Result

Lemma 1

When we use the Bayes classifier $p({\mathbf{y}} \mid {\mathbf{x}}_S)$ as a predictor and $\ell$ is cross entropy loss, the incremental loss improvement is an unbiased estimator of the CMI for each $(x_S, {\mathbf{x}}_i)$ pair:

Figures (19)

  • Figure 1: Diagram of our training approach. At each selection step $n$, the value network $v(x_S ; \phi)$ predicts the CMI for all features, and a single feature $x_i$ is chosen for the next prediction $f(x_{S \cup i} ; \theta)$. The prediction loss is used to update the predictor (see \ref{['eq:obj-predictor']}), and the loss improvement is used to update the value network (see \ref{['eq:obj-value']}). The networks are trained jointly with SGD.
  • Figure 2: Evaluation with tabular datasets for varying feature acquisition budgets. Results are averaged across 5 trials, and shaded regions indicate the standard error for each method.
  • Figure 3: Evaluation with non-uniform feature costs for medical diagnosis tasks. Costs are relative for Intubation and expressed in seconds for ROSMAP. The results show the classification performance for varying levels of average feature acquisition cost.
  • Figure 4: Evaluation of DIME on image datasets with different vision architectures.
  • Figure 5: Evaluation with image datasets for varying numbers of average patches selected. Results are averaged across 5 trials, and shaded regions indicate the standard error for each method.
  • ...and 14 more figures

Theorems & Definitions (19)

  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Proposition 1
  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Lemma 2
  • proof
  • ...and 9 more