Table of Contents
Fetching ...

Model Selection over Partially Ordered Sets

Armeen Taeb, Peter Bühlmann, Venkat Chandrasekaran

TL;DR

The paper presents a general poset-based framework for model selection that extends beyond Boolean structures by endowing model collections with a least element and a rank that captures complexity. It defines true discoveries via a symmetric similarity valuation $\rho$, enabling generalized false discovery metrics $\mathrm{TD}$, $\mathrm{FD}$, and $\mathrm{FDP}$, and introduces two generic FD-control procedures: a stability-based approach and a testing-based approach. The methods apply across diverse domains, including variable selection, clustering, ranking, causal structure learning, changepoint estimation, and blind source separation, with theoretical guarantees on FD control and practical algorithms. Empirical results on synthetic and real data demonstrate controlled false discoveries along with meaningful discoveries, and the authors provide open-source code for implementation.

Abstract

In problems such as variable selection and graph estimation, models are characterized by Boolean logical structure such as presence or absence of a variable or an edge. Consequently, false positive error or false negative error can be specified as the number of variables/edges that are incorrectly included or excluded in an estimated model. However, there are several other problems such as ranking, clustering, and causal inference in which the associated model classes do not admit transparent notions of false positive and false negative errors due to the lack of an underlying Boolean logical structure. In this paper, we present a generic approach to endow a collection of models with partial order structure, which leads to a hierarchical organization of model classes as well as natural analogs of false positive and false negative errors. We describe model selection procedures that provide false positive error control in our general setting and we illustrate their utility with numerical experiments.

Model Selection over Partially Ordered Sets

TL;DR

The paper presents a general poset-based framework for model selection that extends beyond Boolean structures by endowing model collections with a least element and a rank that captures complexity. It defines true discoveries via a symmetric similarity valuation , enabling generalized false discovery metrics , , and , and introduces two generic FD-control procedures: a stability-based approach and a testing-based approach. The methods apply across diverse domains, including variable selection, clustering, ranking, causal structure learning, changepoint estimation, and blind source separation, with theoretical guarantees on FD control and practical algorithms. Empirical results on synthetic and real data demonstrate controlled false discoveries along with meaningful discoveries, and the authors provide open-source code for implementation.

Abstract

In problems such as variable selection and graph estimation, models are characterized by Boolean logical structure such as presence or absence of a variable or an edge. Consequently, false positive error or false negative error can be specified as the number of variables/edges that are incorrectly included or excluded in an estimated model. However, there are several other problems such as ranking, clustering, and causal inference in which the associated model classes do not admit transparent notions of false positive and false negative errors due to the lack of an underlying Boolean logical structure. In this paper, we present a generic approach to endow a collection of models with partial order structure, which leads to a hierarchical organization of model classes as well as natural analogs of false positive and false negative errors. We describe model selection procedures that provide false positive error control in our general setting and we illustrate their utility with numerical experiments.
Paper Structure (42 sections, 15 theorems, 71 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 42 sections, 15 theorems, 71 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 10

Let $(\mathcal{L}, \preceq, \mathrm{rank}(\cdot))$ be a graded discrete model poset with integer-valued similarity valuation $\rho$ and let $\mathcal{S}$ be an associated set of minimal covering pairs. Let $\hat{x}_{\mathrm{base}}$ be a base estimator. Suppose the dataset $\mathcal{D}$ employed in t Here the set $\mathcal{T}_\text{null}:=\{(u,v) \text{ covering pair in } \mathcal{L} ~|~ \rho(v,x^\

Figures (4)

  • Figure 1: Hasse diagrams for a) variable selection with $3$ variables (Example \ref{['ex:variable-selection']}); b) clustering $4$ variables (Example \ref{['ex:clustering']}); c) multisample testing with $4$ samples (Example \ref{['ex:multisample-testing']}); d) causal inference with $3$ variables (Example \ref{['ex:causal-learning']}); e) partial ranking of $3$ items (Example \ref{['ex:partial-ranking']}); and f) total ranking of $3$ items (Example \ref{['ex:complete-ranking']}).
  • Figure 2: Comparing the performance of Algorithm \ref{['alg:poset_stability_discrete']} with $\Psi = \Psi_\text{stable}$ versus a non-subsampling approach for total ranking, clustering, and causal structure learning. Each problem setting corresponds to a pair of dots and a connecting line. The comparison is in terms of the amount of false and true discoveries.
  • Figure 3: left: CPDAG obtained by Algorithm \ref{['alg:poset_stability_discrete']} with $\Psi = \Psi_\text{stable}$; right: comparing the edges obtained by our algorithm (shown in the leftmost column) with different causal discovery methods (with indicated reference). The consensus network according to sachs is denoted here by "sachsa" and their reconstructed network by "sachsb"; The authors in nicolai_pnas apply two methods, and the results are presented by "nicolai_pnasa" and "nicolai_pnasb". Here, "$-$" means that the edge direction is not identified.
  • Figure 4: Four CPDAGs. Here, CPDAGs $\mathcal{C}_3$ and $\mathcal{C}_4$ are both largest complexity models that are smaller (in partial order sense) than $\mathcal{C}_1$ and $\mathcal{C}_2$. Similarly, CPDAGs $\mathcal{C}_1$ and $\mathcal{C}_2$ are the smallest complexity models that are larger (in a partial order sense) than $\mathcal{C}_3$ and $\mathcal{C}_4$.

Theorems & Definitions (41)

  • Example 1: Variable selection
  • Example 2: Clustering
  • Example 3: Multisample testing
  • Example 4: Causal structure learning
  • Example 5: Multiple changepoint estimation
  • Example 6: Partial ranking
  • Example 7: Total ranking
  • Example 8: Subspace estimation
  • Example 9: Blind source separation
  • Definition 1: similarity valuation
  • ...and 31 more