Table of Contents
Fetching ...

Intrinsic Dimensionality as a Model-Free Measure of Class Imbalance

Çağrı Eser, Zeynep Sonat Baltacı, Emre Akbaş, Sinan Kalkan

TL;DR

The paper tackles class imbalance by introducing data Intrinsic Dimensionality (ID), a model-free, training-free measure estimated per class with FisherS that yields normalized per-class scores $\hat{d}_c$. ID captures intrinsic data complexity beyond cardinality and redundancy, and its estimates are robust to sample size, extrinsic dimension, and noise. The authors show how to integrate $\hat{d}_c$ into resampling, loss reweighting, and margin-based methods, and they demonstrate substantial gains across CIFAR-LT, Places-LT, ImageNet-LT, and semantic-imbalance datasets like SVCI-20. Across five datasets, ID-based mitigation outperforms cardinality-based baselines and is competitive with state-of-the-art approaches, with the added advantages of being model-free and scalable.

Abstract

Imbalance in classification tasks is commonly quantified by the cardinalities of examples across classes. This, however, disregards the presence of redundant examples and inherent differences in the learning difficulties of classes. Alternatively, one can use complex measures such as training loss and uncertainty, which, however, depend on training a machine learning model. Our paper proposes using data Intrinsic Dimensionality (ID) as an easy-to-compute, model-free measure of imbalance that can be seamlessly incorporated into various imbalance mitigation methods. Our results across five different datasets with a diverse range of imbalance ratios show that ID consistently outperforms cardinality-based re-weighting and re-sampling techniques used in the literature. Moreover, we show that combining ID with cardinality can further improve performance. Code: https://github.com/cagries/IDIM.

Intrinsic Dimensionality as a Model-Free Measure of Class Imbalance

TL;DR

The paper tackles class imbalance by introducing data Intrinsic Dimensionality (ID), a model-free, training-free measure estimated per class with FisherS that yields normalized per-class scores . ID captures intrinsic data complexity beyond cardinality and redundancy, and its estimates are robust to sample size, extrinsic dimension, and noise. The authors show how to integrate into resampling, loss reweighting, and margin-based methods, and they demonstrate substantial gains across CIFAR-LT, Places-LT, ImageNet-LT, and semantic-imbalance datasets like SVCI-20. Across five datasets, ID-based mitigation outperforms cardinality-based baselines and is competitive with state-of-the-art approaches, with the added advantages of being model-free and scalable.

Abstract

Imbalance in classification tasks is commonly quantified by the cardinalities of examples across classes. This, however, disregards the presence of redundant examples and inherent differences in the learning difficulties of classes. Alternatively, one can use complex measures such as training loss and uncertainty, which, however, depend on training a machine learning model. Our paper proposes using data Intrinsic Dimensionality (ID) as an easy-to-compute, model-free measure of imbalance that can be seamlessly incorporated into various imbalance mitigation methods. Our results across five different datasets with a diverse range of imbalance ratios show that ID consistently outperforms cardinality-based re-weighting and re-sampling techniques used in the literature. Moreover, we show that combining ID with cardinality can further improve performance. Code: https://github.com/cagries/IDIM.

Paper Structure

This paper contains 28 sections, 16 equations, 11 figures, 17 tables, 1 algorithm.

Figures (11)

  • Figure 1: Existing approaches to long-tailed visual recognition rely on (a) the cardinalities of examples across classes, which is affected by redundant examples in the dataset, or (b) measures of class hardness, which require training an ML model. (c) Our approach of using data ID provides information about the manifold (and hence, complexity) of each class without being affected by redundant examples and without training an ML model. (d) We show that this simple, model-free, plug-and-play approach provides significant improvements with different approaches (a sample result shown here with a resampling approach).
  • Figure 2: Estimated IDs for CIFAR10 classes for different sample counts. ID can capture inherent differences among classes and is relatively robust against sample count, as exemplified here with the CIFAR10 dataset, subsampled at different sample counts per class.
  • Figure 3: Analysis on FisherS estimated IDs. The analysis with the synthetic data in (a) and a real dataset in Fig. \ref{['fig:id-against-cardinality']} suggest that ID estimation is robust against sample count. Moreover, the analysis in (b) shows that it is not affected by extrinsic dimensionality either. Dashed lines: true ID values.
  • Figure 4: ID estimates of CIFAR-10 classes with added Gaussian noise of varying scales ($\sigma\in[0, 1]$), showing robustness of ID to even drastic amounts of noise.
  • Figure 5: Exp. 3: Histogram of estimated ID values of the two semantically-different types of classes in the SVCI-20 dataset. See \ref{['sect:robustness_analysis']} for all ID values.
  • ...and 6 more figures