The generalized underlap coefficient with an application in clustering

Zhaoxi Zhang; Vanda Inacio; Sara Wade

The generalized underlap coefficient with an application in clustering

Zhaoxi Zhang, Vanda Inacio, Sara Wade

TL;DR

The underlap coefficient (UNL), a multi-group separation measure, is generalized to multivariate variables and key properties of UNL are established and an explicit connection to the total variation is established.

Abstract

Quantifying distributional separation across groups is fundamental in statistical learning and scientific discovery, yet most classical discrepancy measures are tailored to two-group comparisons. We generalize the underlap coefficient (UNL), a multi-group separation measure, to multivariate variables. We establish key properties of UNL and provide an explicit connection to the total variation. We further interpret the UNL as a dependence measure between a group label and variables of interest and compare it with mutual information. We propose an importance sampling estimator of the UNL that can be combined with flexible density estimators. The utility of the UNL for assessing partition-covariate dependence in clustering is highlighted in detail, where it is particularly useful for evaluating the single-weights assumption in covariate-dependent mixture models. Finally we illustrate the application of the UNL in clustering using two real world datasets.

The generalized underlap coefficient with an application in clustering

TL;DR

Abstract

Paper Structure (36 sections, 2 theorems, 92 equations, 24 figures, 1 algorithm)

This paper contains 36 sections, 2 theorems, 92 equations, 24 figures, 1 algorithm.

Introduction
The generalized underlap coefficient
Properties of the underlap coefficient
Connecting the underlap coefficient with total variation
Comparing the underlap coefficient with mutual information
Illustrative examples comparing UNL with MI.
Estimation of UNL by importance sampling
UNL as a tool for discovering covariate dependence in cluster analysis
Clustering based only on the response
The marginal approach of mixture models.
Detecting dependence of the partition on covariates in the marginal approach.
Incorporating covariate information in clustering
The conditional approach of mixture models.
Detecting dependence of the partition on covariates in the conditional approach.
Real data illustrations
...and 21 more sections

Key Result

Proposition 1

Suppose $P_1$ and $P_2$ are probability measures absolutely continuous with respect to $\nu$, with Radon-Nikodym derivatives $f_1$ and $f_2$, then $\text{UNL}(f_1,f_2)=1+\text{TV}(P_1,P_2)$

Figures (24)

Figure 1: Curves of UNL and $\mathrm{MI}_Z$ in the three-class Gaussian example, where $Y \mid Z=k \sim \mathrm{N}(\mu_k,1)$ and $\mu_1=-D, \mu_2=0, \mu_3=D$ (top row), and where where $Y \mid Z=k \sim \mathrm{N}(\mu_k,1)$ and $\mu_1=-0.1, \mu_2=0, \mu_3=D$ (bottom row).
Figure 2: Top row: Example A. Bottom row: Example B. Left: the representative partition inferred from the DPM. Right: histograms of the estimated UNL of the covariates.
Figure 3: Top row: Example C1. Bottom row: Example C2. Left: the representative partition inferred from the DPM. Right: the histograms of the estimated UNL of the covariates.
Figure 4: Heatmap of the true and estimated density regression functions of Examples C1 and C2 conditioned on $x^d=1$. Top row: Example C1. Bottom row: Example C2.
Figure 5: Left: the representative partition of Example D inferred from the LDDP model. Right: the histograms of the estimated UNL of the covariates of Example D.
...and 19 more figures

Theorems & Definitions (16)

Definition 1: UNL for continuous variables
Definition 2: UNL for discrete variables
Definition 3: UNL for mixed type variables
Definition 4: Measure theoretic formulation of UNL
Proposition 1: UNL's relationship with total variation distance when $K=2$
Proposition 2: UNL equals total variation norm of a vector-valued measure consisted of K probability measures
proof
proof
proof
proof
...and 6 more

The generalized underlap coefficient with an application in clustering

TL;DR

Abstract

The generalized underlap coefficient with an application in clustering

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (24)

Theorems & Definitions (16)