Table of Contents
Fetching ...

Variable Selection in Maximum Mean Discrepancy for Interpretable Distribution Comparison

Kensuke Mitsuzawa, Motonobu Kanagawa, Stefano Bortoli, Margherita Grossi, Paolo Papotti

TL;DR

This work defines a formal discriminating set of variables that capture all distributional differences between two datasets and proves its uniqueness, enabling a principled ground truth for two-sample variable selection. It then contributes two ARD-based, sparsity-promoting methods to maximise MMD power while downweighting redundant variables, along with two data-driven strategies to select the regularisation parameter and to aggregate results across candidates. The methods are validated on synthetic data and demonstrated on real-world-like applications involving water-pipe leakage and traffic network perturbations, showing improved recall-precision and stability, especially with aggregation. The study advances interpretable distribution comparison by coupling rigorous theory with practical, scalable algorithms for identifying discriminating variables in high-dimensional settings.

Abstract

We study two-sample variable selection: identifying variables that discriminate between the distributions of two sets of data vectors. Such variables help scientists understand the mechanisms behind dataset discrepancies. Although domain-specific methods exist (e.g., in medical imaging, genetics, and computational social science), a general framework remains underdeveloped. We make two separate contributions. (i) We introduce a mathematical notion of the discriminating set of variables: the largest subset containing no variables whose marginals are identical across the two distributions and independent of the remaining variables. We prove this set is uniquely defined and establish further properties, making it a suitable ground truth for theory and evaluation. (ii) We propose two methods for two-sample variable selection that assign weights to variables and optimise them to maximise the power of a kernel two-sample test while enforcing sparsity to downweight redundant variables. To select the regularisation parameter - unknown in practice, as it controls the number of selected variables - we develop two data-driven procedures to balance recall and precision. Synthetic experiments show improved performance over baselines, and we illustrate the approach on two applications using datasets from water-pipe and traffic networks.

Variable Selection in Maximum Mean Discrepancy for Interpretable Distribution Comparison

TL;DR

This work defines a formal discriminating set of variables that capture all distributional differences between two datasets and proves its uniqueness, enabling a principled ground truth for two-sample variable selection. It then contributes two ARD-based, sparsity-promoting methods to maximise MMD power while downweighting redundant variables, along with two data-driven strategies to select the regularisation parameter and to aggregate results across candidates. The methods are validated on synthetic data and demonstrated on real-world-like applications involving water-pipe leakage and traffic network perturbations, showing improved recall-precision and stability, especially with aggregation. The study advances interpretable distribution comparison by coupling rigorous theory with practical, scalable algorithms for identifying discriminating variables in high-dimensional settings.

Abstract

We study two-sample variable selection: identifying variables that discriminate between the distributions of two sets of data vectors. Such variables help scientists understand the mechanisms behind dataset discrepancies. Although domain-specific methods exist (e.g., in medical imaging, genetics, and computational social science), a general framework remains underdeveloped. We make two separate contributions. (i) We introduce a mathematical notion of the discriminating set of variables: the largest subset containing no variables whose marginals are identical across the two distributions and independent of the remaining variables. We prove this set is uniquely defined and establish further properties, making it a suitable ground truth for theory and evaluation. (ii) We propose two methods for two-sample variable selection that assign weights to variables and optimise them to maximise the power of a kernel two-sample test while enforcing sparsity to downweight redundant variables. To select the regularisation parameter - unknown in practice, as it controls the number of selected variables - we develop two data-driven procedures to balance recall and precision. Synthetic experiments show improved performance over baselines, and we illustrate the approach on two applications using datasets from water-pipe and traffic networks.
Paper Structure (40 sections, 4 theorems, 38 equations, 26 figures, 3 algorithms)

This paper contains 40 sections, 4 theorems, 38 equations, 26 figures, 3 algorithms.

Key Result

Proposition 1

For any probability distributions $P$ and $Q$ on $\mathbb{R}^D$, a discriminating subset $S \subset \{1, \dots, D\}$ satisfying Definition def:subset exists uniquely. Moreover, let $U \subset \{1, \dots, D\}$ be the largest subset on which $P$ and $Q$ have identical marginals and which are independe Then the discriminating set is given by its complement, and we obtain the decomposition

Figures (26)

  • Figure 1: Illustration of discriminating variables = pixels (yellow dots) selected by one of the proposed methods (Algorithm \ref{['alg-enhanced-stability-selection']}), applied to two sets of face images of cats and dogs. Two example images from each set is shown. See Appendix \ref{['sec:cats-dogs-demo']} for details.
  • Figure 2: Optimised ARD weights without regularisation (Left, Sutherland2016) and with regularisation (Right, Algorithm \ref{['alg-hyperparameter_selection']}). Here, $S = \{1, 4\}$ are the true discriminating variables for distinguishing $P$ and $Q$, and $U = \{0, 2, 3, 5, \dots, 19\}$ are the redundant variables with zero-variance marginal distributions. Without regularisation, the redundant variables' ARD weights do not change from their initial value of $1$, burying the weights of the discriminating variables. With regularisation, these redundant variables are successfully eliminated. The details of the setting are described in Section \ref{['sec:synthetic_data_assessment']} ("Redundant Dirac").
  • Figure 3: Synthetic data experiment results (Section \ref{['sec:synthetic_data_assessment']}). The top, middle, and bottom panels report F score, precision, and recall, respectively. Groups correspond to the six settings. Bars show mean $\pm$ standard deviation over 10 runs.
  • Figure 4: F scores for different sample sizes in the Laplace distribution setting in Section \ref{['sec:synthetic_data_assessment']}. The horizontal axis indicates sample sizes. For each sample size and each method, the confidence interval shows the standard deviation of the F scores over 10 experiments.
  • Figure 5: L-Town network from the BattLeDIM 2020 dataset, showing installed sensors (squares). AMR and Pressure sensors are mounted at nodes; some nodes host both types. Red triangles mark pipe segments with programmed leaks. Leaks occur at different times.
  • ...and 21 more figures

Theorems & Definitions (19)

  • Definition 1: Discriminating Set of Variables
  • Proposition 1
  • proof
  • Corollary 1
  • proof
  • Example 1
  • Example 2
  • Remark 1
  • Proposition 2
  • proof
  • ...and 9 more