Table of Contents
Fetching ...

A Survey on Small Sample Imbalance Problem: Metrics, Feature Analysis, and Solutions

Shuxian Zhao, Jie Gui, Minjing Dong, Baosheng Yu, Zhipeng Gui, Lu Dong, Yuan Yan Tang, James Tin-Yau Kwok

TL;DR

The paper tackles the small sample imbalance (S&I) problem by proposing a data-centric analytical framework that integrates imbalance metrics, data/feature complexity, and dataset characteristics to guide solution design. It surveys a broad spectrum of methods across conventional resampling, data-complexity–driven techniques, and extreme S&I approaches, supplemented by experiments showing that classifier choice often outweighs gains from resampling. A key contribution is the diagnostic emphasis on data characteristics and inter-class/intra-class difficulty, rather than relying solely on data augmentation. The findings inform principled, dataset-specific strategy selection and highlight open questions in dataset construction, domain adaptation, and evaluation for robust S&I handling in real-world tasks.

Abstract

The small sample imbalance (S&I) problem is a major challenge in machine learning and data analysis. It is characterized by a small number of samples and an imbalanced class distribution, which leads to poor model performance. In addition, indistinct inter-class feature distributions further complicate classification tasks. Existing methods often rely on algorithmic heuristics without sufficiently analyzing the underlying data characteristics. We argue that a detailed analysis from the data perspective is essential before developing an appropriate solution. Therefore, this paper proposes a systematic analytical framework for the S\&I problem. We first summarize imbalance metrics and complexity analysis methods, highlighting the need for interpretable benchmarks to characterize S&I problems. Second, we review recent solutions for conventional, complexity-based, and extreme S&I problems, revealing methodological differences in handling various data distributions. Our summary finds that resampling remains a widely adopted solution. However, we conduct experiments on binary and multiclass datasets, revealing that classifier performance differences significantly exceed the improvements achieved through resampling. Finally, this paper highlights open questions and discusses future trends.

A Survey on Small Sample Imbalance Problem: Metrics, Feature Analysis, and Solutions

TL;DR

The paper tackles the small sample imbalance (S&I) problem by proposing a data-centric analytical framework that integrates imbalance metrics, data/feature complexity, and dataset characteristics to guide solution design. It surveys a broad spectrum of methods across conventional resampling, data-complexity–driven techniques, and extreme S&I approaches, supplemented by experiments showing that classifier choice often outweighs gains from resampling. A key contribution is the diagnostic emphasis on data characteristics and inter-class/intra-class difficulty, rather than relying solely on data augmentation. The findings inform principled, dataset-specific strategy selection and highlight open questions in dataset construction, domain adaptation, and evaluation for robust S&I handling in real-world tasks.

Abstract

The small sample imbalance (S&I) problem is a major challenge in machine learning and data analysis. It is characterized by a small number of samples and an imbalanced class distribution, which leads to poor model performance. In addition, indistinct inter-class feature distributions further complicate classification tasks. Existing methods often rely on algorithmic heuristics without sufficiently analyzing the underlying data characteristics. We argue that a detailed analysis from the data perspective is essential before developing an appropriate solution. Therefore, this paper proposes a systematic analytical framework for the S\&I problem. We first summarize imbalance metrics and complexity analysis methods, highlighting the need for interpretable benchmarks to characterize S&I problems. Second, we review recent solutions for conventional, complexity-based, and extreme S&I problems, revealing methodological differences in handling various data distributions. Our summary finds that resampling remains a widely adopted solution. However, we conduct experiments on binary and multiclass datasets, revealing that classifier performance differences significantly exceed the improvements achieved through resampling. Finally, this paper highlights open questions and discusses future trends.

Paper Structure

This paper contains 17 sections, 16 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The Systematic Analysis Framework of the Small Sample Imbalance Problem.
  • Figure 2: Distribution of inter-class imbalance.
  • Figure 3: Other Example Imbalances.
  • Figure 4: Correlation between imbalance indicators. (In adjustedIR, $\lambda$ is set to 1. In IF, $\alpha$ is set to 2. In $R_{\text{aug}}$, $k$ is 5 and $\theta$ is 2. In $\text{IBI}^{3}$, $k$ is 5.)
  • Figure 5: t-SNE plots of three different feature distributions.
  • ...and 2 more figures