Table of Contents
Fetching ...

Improving Online Bagging for Complex Imbalanced Data Stream

Bartosz Przybyl, Jerzy Stefanowski

TL;DR

This paper addresses learning from imbalanced data streams with concept drift and local difficulty factors such as minority sub-concepts and unsafe examples. It introduces Neighborhood Oversampling Online Bagging (NOOB), Neighborhood Undersampling Online Bagging (NUOB), and a Hybrid variant (HNOB) that adjust the Poisson sampling parameter $\lambda$ using local safeness metrics derived from a sliding window of size $W$ and $k$ nearest neighbours, e.g., $L^2_{min}$ and $L^2_{maj}$. Empirical results on synthetic streams show that NUOB excels with borderline minority drifts, NOOB with rare minority drifts, and HNOB offers the strongest overall performance in complex multi-factor drift scenarios. The findings demonstrate the value of incorporating local neighbourhood information into online bagging for robust imbalanced-stream learning, with practical implications for adaptive classifiers in non-stationary environments.

Abstract

Learning classifiers from imbalanced and concept drifting data streams is still a challenge. Most of the current proposals focus on taking into account changes in the global imbalance ratio only and ignore the local difficulty factors, such as the minority class decomposition into sub-concepts and the presence of unsafe types of examples (borderline or rare ones). As the above factors present in the stream may deteriorate the performance of popular online classifiers, we propose extensions of resampling online bagging, namely Neighbourhood Undersampling or Oversampling Online Bagging to take better account of the presence of unsafe minority examples. The performed computational experiments with synthetic complex imbalanced data streams have shown their advantage over earlier variants of online bagging resampling ensembles.

Improving Online Bagging for Complex Imbalanced Data Stream

TL;DR

This paper addresses learning from imbalanced data streams with concept drift and local difficulty factors such as minority sub-concepts and unsafe examples. It introduces Neighborhood Oversampling Online Bagging (NOOB), Neighborhood Undersampling Online Bagging (NUOB), and a Hybrid variant (HNOB) that adjust the Poisson sampling parameter using local safeness metrics derived from a sliding window of size and nearest neighbours, e.g., and . Empirical results on synthetic streams show that NUOB excels with borderline minority drifts, NOOB with rare minority drifts, and HNOB offers the strongest overall performance in complex multi-factor drift scenarios. The findings demonstrate the value of incorporating local neighbourhood information into online bagging for robust imbalanced-stream learning, with practical implications for adaptive classifiers in non-stationary environments.

Abstract

Learning classifiers from imbalanced and concept drifting data streams is still a challenge. Most of the current proposals focus on taking into account changes in the global imbalance ratio only and ignore the local difficulty factors, such as the minority class decomposition into sub-concepts and the presence of unsafe types of examples (borderline or rare ones). As the above factors present in the stream may deteriorate the performance of popular online classifiers, we propose extensions of resampling online bagging, namely Neighbourhood Undersampling or Oversampling Online Bagging to take better account of the presence of unsafe minority examples. The performed computational experiments with synthetic complex imbalanced data streams have shown their advantage over earlier variants of online bagging resampling ensembles.
Paper Structure (6 sections, 4 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 6 sections, 4 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: An illustration of different difficulty factors in imbalanced data - based on BrzezStef2019
  • Figure 2: Plots showing bagging variants reacting to two kinds of drift 80% rare minority examples and G-mean measure (the left-hand figure) and minority class split into 5 sub-concepts and Recall measure (the right-hand figure)
  • Figure 3: Plots showing G-mean measure of bagging variants reacting to two kinds of drift 80% rare minority examples and Split 5 and imbalanced ratio changing from 10% to 1% (the left-hand figure) and imbalance ratio 10% and 80% borderline minority examples (the right-hand figure)
  • Figure 4: Plots of G-mean measures for complex data streams Split5+Im1+Borderline40+Rare40 (the left-hand figure) and StaticIm10+Im1+Rare80 (the right-hand figure)