Table of Contents
Fetching ...

What is the $\textit{intrinsic}$ dimension of your binary data? -- and how to compute it quickly

Tom Hanika, Tobias Hille

TL;DR

This work addresses the intrinsic dimensionality of binary data by adopting a formal-concept analysis (FCA) based geometric data-set framework to define and compute an intrinsic dimension (ID). It introduces a minimum-support approximation that yields computable lower and upper bounds Ī”_-(š’Ÿ_s) and Ī”_+(š’Ÿ_s) for the ID, and demonstrates feasibility on standard binary-data collections via concept mining and a two-pointer algorithm to accumulate the observable diameter. The findings show that ID captures aspects of data structure distinct from the normalized correlation dimension, providing informative bounds even when full concept enumeration is expensive; results vary across datasets and underscore the trade-off between bound tightness and computational effort. The approach offers a non-metric, FCA-based tool for binary data analysis with potential for broader applicability in dimension-aware data mining and FCA-driven analytics.

Abstract

Dimensionality is an important aspect for analyzing and understanding (high-dimensional) data. In their 2006 ICDM paper Tatti et al. answered the question for a (interpretable) dimension of binary data tables by introducing a normalized correlation dimension. In the present work we revisit their results and contrast them with a concept based notion of intrinsic dimension (ID) recently introduced for geometric data sets. To do this, we present a novel approximation for this ID that is based on computing concepts only up to a certain support value. We demonstrate and evaluate our approximation using all available datasets from Tatti et al., which have between 469 and 41271 extrinsic dimensions.

What is the $\textit{intrinsic}$ dimension of your binary data? -- and how to compute it quickly

TL;DR

This work addresses the intrinsic dimensionality of binary data by adopting a formal-concept analysis (FCA) based geometric data-set framework to define and compute an intrinsic dimension (ID). It introduces a minimum-support approximation that yields computable lower and upper bounds Ī”_-(š’Ÿ_s) and Ī”_+(š’Ÿ_s) for the ID, and demonstrates feasibility on standard binary-data collections via concept mining and a two-pointer algorithm to accumulate the observable diameter. The findings show that ID captures aspects of data structure distinct from the normalized correlation dimension, providing informative bounds even when full concept enumeration is expensive; results vary across datasets and underscore the trade-off between bound tightness and computational effort. The approach offers a non-metric, FCA-based tool for binary data analysis with potential for broader applicability in dimension-aware data mining and FCA-driven analytics.

Abstract

Dimensionality is an important aspect for analyzing and understanding (high-dimensional) data. In their 2006 ICDM paper Tatti et al. answered the question for a (interpretable) dimension of binary data tables by introducing a normalized correlation dimension. In the present work we revisit their results and contrast them with a concept based notion of intrinsic dimension (ID) recently introduced for geometric data sets. To do this, we present a novel approximation for this ID that is based on computing concepts only up to a certain support value. We demonstrate and evaluate our approximation using all available datasets from Tatti et al., which have between 469 and 41271 extrinsic dimensions.
Paper Structure (19 sections, 1 theorem, 13 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 19 sections, 1 theorem, 13 equations, 9 figures, 3 tables, 2 algorithms.

Key Result

proposition thmcounterproposition

For $s\in[0,1]$ let $\alpha_{-1}\coloneqq\max\{\alpha\mid\text{ObsDiam}(\mathscr{D}_s(\mathbb{K});-\alpha)>0\}$. Then for all $\alpha\in[0,1/2]$ the $\text{ObsDiam}(\mathscr{D}(\mathbb{K});-\alpha) \leq\text{ObsDiam}(\mathscr{D}_s(\mathbb{K});-\alpha_{-1})$.

Figures (9)

  • Figure 1: Exemplary visualization of the calculation steps for the geometric intrinsic dimension and the corresponding bounds for minimum support $1/2$. The used symbols for the bounds are introduced at the end of \ref{['sec:approximation:min_support']}
  • Figure 2: Results of the computation for the chess data set.
  • Figure 3: Results of the computation for the living beings in water data set.
  • Figure 4: Results of the computation for the mushroom data set.
  • Figure 5: Results of the computation for the accidents data set.
  • ...and 4 more figures

Theorems & Definitions (2)

  • proposition thmcounterproposition
  • proof