Table of Contents
Fetching ...

Prototype Selection Using Topological Data Analysis

Jordan Eckert, Elvan Ceyhan, Henry Schenck

TL;DR

TPS addresses the challenge of reducing dataset size for classification while preserving performance by leveraging topological structure through persistent homology. The method constructs a two-parameter (bifiltration) topological representation to identify boundary-proximate, topology-rich prototypes, extracting vertex sets from carefully selected sub-complexes. Empirical results on nine simulated datasets and eight real-world datasets show TPS achieves substantial data reduction (roughly 60-85%) with maintained or improved G-Mean across classifiers, outperforming several baseline prototype selectors in many settings, though performance can depend on the metric. The work demonstrates the feasibility and practicality of topology-informed prototype selection, offering a scalable and interpretable alternative for model reduction and data summarization.

Abstract

Recently, there has been an explosion in statistical learning literature to represent data using topological principles to capture structure and relationships. We propose a topological data analysis (TDA)-based framework, named Topological Prototype Selector (TPS), for selecting representative subsets (prototypes) from large datasets. We demonstrate the effectiveness of TPS on simulated data under different data intrinsic characteristics, and compare TPS against other currently used prototype selection methods in real data settings. In all simulated and real data settings, TPS significantly preserves or improves classification performance while substantially reducing data size. These contributions advance both algorithmic and geometric aspects of prototype learning and offer practical tools for parallelized, interpretable, and efficient classification.

Prototype Selection Using Topological Data Analysis

TL;DR

TPS addresses the challenge of reducing dataset size for classification while preserving performance by leveraging topological structure through persistent homology. The method constructs a two-parameter (bifiltration) topological representation to identify boundary-proximate, topology-rich prototypes, extracting vertex sets from carefully selected sub-complexes. Empirical results on nine simulated datasets and eight real-world datasets show TPS achieves substantial data reduction (roughly 60-85%) with maintained or improved G-Mean across classifiers, outperforming several baseline prototype selectors in many settings, though performance can depend on the metric. The work demonstrates the feasibility and practicality of topology-informed prototype selection, offering a scalable and interpretable alternative for model reduction and data summarization.

Abstract

Recently, there has been an explosion in statistical learning literature to represent data using topological principles to capture structure and relationships. We propose a topological data analysis (TDA)-based framework, named Topological Prototype Selector (TPS), for selecting representative subsets (prototypes) from large datasets. We demonstrate the effectiveness of TPS on simulated data under different data intrinsic characteristics, and compare TPS against other currently used prototype selection methods in real data settings. In all simulated and real data settings, TPS significantly preserves or improves classification performance while substantially reducing data size. These contributions advance both algorithmic and geometric aspects of prototype learning and offer practical tools for parallelized, interpretable, and efficient classification.

Paper Structure

This paper contains 19 sections, 5 equations, 44 figures, 6 tables, 3 algorithms.

Figures (44)

  • Figure 1: Geometric representations of $q$-simplices. Each corresponds to their respective tetrahedral dimensional analogues.
  • Figure 2: Comparison of Čech complex (left) with Rips complex (right) at the same radius. The simplicial Čech complex consists of a hollow triangle since all three balls do not overlap, where the Rips complex is the solid triangle.
  • Figure 3: Example of barcode and persistent homology visualizations. Left is the original data, middle is the corresponding persistent homology diagram (birth, death) pairs plotted, and right is the barcodes for specific filtration values.
  • Figure 4: An example of a commutative diagram for a bifiltration of $\mathcal{K}$ across two parameters. Each $\mathcal{K}_{i,j}$ represents a sub-complex.
  • Figure 5: Effect of neighbor quantile hyperparameter ($q$) selection on prototype count and G-Mean compared to baseline TPS.
  • ...and 39 more figures

Theorems & Definitions (8)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Definition 8