Table of Contents
Fetching ...

Selective inference after convex clustering with $\ell_1$ penalization

François Bachoc, Cathy Maugis-Rabusseau, Pierre Neuvial

TL;DR

This work develops selective inference after convex clustering with an $\ell_1$ penalty by deriving a polyhedral conditioning framework for Gaussian vectors, enabling valid post-clustering hypothesis testing. It first establishes a polyhedral characterization in the one-dimensional case and a corresponding test with conditional and unconditional guarantees, along with a regularization-path algorithm. It then extends the approach to the $p$-dimensional setting by aggregating one-dimensional clusterings, formulating a multi-dimensional testing procedure with rigorous guarantees under a matrix-normal model, and validating the method through numerical experiments. The methods are implemented in the R package poclin, and the results demonstrate proper type-I error control and competitive power, with practical relevance for post-clustering inference in applications like single-cell analysis.

Abstract

Classical inference methods notoriously fail when applied to data-driven test hypotheses or inference targets. Instead, dedicated methodologies are required to obtain statistical guarantees for these selective inference problems. Selective inference is particularly relevant post-clustering, typically when testing a difference in mean between two clusters. In this paper, we address convex clustering with $\ell_1$ penalization, by leveraging related selective inference tools for regression, based on Gaussian vectors conditioned to polyhedral sets. In the one-dimensional case, we prove a polyhedral characterization of obtaining given clusters, than enables us to suggest a test procedure with statistical guarantees. This characterization also allows us to provide a computationally efficient regularization path algorithm. Then, we extend the above test procedure and guarantees to multi-dimensional clustering with $\ell_1$ penalization, and also to more general multi-dimensional clusterings that aggregate one-dimensional ones. With various numerical experiments, we validate our statistical guarantees and we demonstrate the power of our methods to detect differences in mean between clusters. Our methods are implemented in the R package poclin.

Selective inference after convex clustering with $\ell_1$ penalization

TL;DR

This work develops selective inference after convex clustering with an penalty by deriving a polyhedral conditioning framework for Gaussian vectors, enabling valid post-clustering hypothesis testing. It first establishes a polyhedral characterization in the one-dimensional case and a corresponding test with conditional and unconditional guarantees, along with a regularization-path algorithm. It then extends the approach to the -dimensional setting by aggregating one-dimensional clusterings, formulating a multi-dimensional testing procedure with rigorous guarantees under a matrix-normal model, and validating the method through numerical experiments. The methods are implemented in the R package poclin, and the results demonstrate proper type-I error control and competitive power, with practical relevance for post-clustering inference in applications like single-cell analysis.

Abstract

Classical inference methods notoriously fail when applied to data-driven test hypotheses or inference targets. Instead, dedicated methodologies are required to obtain statistical guarantees for these selective inference problems. Selective inference is particularly relevant post-clustering, typically when testing a difference in mean between two clusters. In this paper, we address convex clustering with penalization, by leveraging related selective inference tools for regression, based on Gaussian vectors conditioned to polyhedral sets. In the one-dimensional case, we prove a polyhedral characterization of obtaining given clusters, than enables us to suggest a test procedure with statistical guarantees. This characterization also allows us to provide a computationally efficient regularization path algorithm. Then, we extend the above test procedure and guarantees to multi-dimensional clustering with penalization, and also to more general multi-dimensional clusterings that aggregate one-dimensional ones. With various numerical experiments, we validate our statistical guarantees and we demonstrate the power of our methods to detect differences in mean between clusters. Our methods are implemented in the R package poclin.
Paper Structure (38 sections, 89 equations, 11 figures, 1 table, 1 algorithm)

This paper contains 38 sections, 89 equations, 11 figures, 1 table, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of Definition \ref{['def:clust-seg']} for a clustering with $K=3$ clusters of the observed values $\boldsymbol{x}=(2,6,11,10,7,1,6.5,7)$.
  • Figure 2: Regularization path (see Section \ref{['subsection:regularization:path:one-dimensional']}) associated to the convex clustering problem for the observed values $\boldsymbol{x}=(2,6,11,10,7,1,6.5,7)$.
  • Figure 3: Left: empirical density of $\boldsymbol{\eta}^\top \boldsymbol{\mu}$ for each $\nu$. Right: empirical cumulative distribution functions of the $p$-value of the test of equality between the means of two clusters.
  • Figure 4: The empirical cumulative distribution function of the $p$-values across $500$ experiments for $n=100$ with our method poclin (in green) and the Student test (in orange). Each column corresponds to a variable $j$, each row to a value of $\nu$ and each line type to a value of $\rho$.
  • Figure 5: The empirical cumulative distribution function of the $p$-values across $500$ experiments for $n=1000$ with our method poclin (in green) and the Student test (in orange). Each column corresponds to a variable $j$, each row to a value of $\nu$ and each line type to a value of $\rho$.
  • ...and 6 more figures

Theorems & Definitions (14)

  • proof : Proof of Lemma \ref{['lm:increasing-hat-B:least:square']}
  • proof : Proof of Lemma \ref{['lem:linearization:convex']}
  • proof : Proof of Lemma \ref{['lemma:condition:obs:absolute:value']}
  • proof : Proof of Lemma \ref{['lm:increasing-hat-B']}
  • proof : Proof of Theorem \ref{['theorem:equivalence:polyhedral:one:d']}
  • proof : Proof and full expressions for Lemma \ref{['corPolyh']}
  • proof : Proof of Proposition \ref{['prop:polyhedral:lemma']}
  • proof : Proof of Lemma \ref{['lemma:expression:pvalue']}
  • proof : Proof of Proposition \ref{['prop:level:conditional']}
  • proof : Proof of Proposition \ref{['proposition:unconditional:level']}
  • ...and 4 more