Table of Contents
Fetching ...

A Similarity-Based Oversampling Method for Multi-label Imbalanced Text Data

Ismail Hakki Karaman, Gulser Koksal, Levent Eriskin, Salih Salihoglu

TL;DR

The paper addresses the challenge of labeled data scarcity and severe class imbalance in multi-label text classification. It introduces a similarity-based oversampling approach that sources new labeled instances from unlabeled data by leveraging embedding-based similarity and only retaining candidates that demonstrably improve classifier performance. On the OPP-115 dataset, the method raises the $F1$-score from $0.5961$ to $0.628$, a $5.34\%$ gain, after adding $90$ instances. This approach offers a scalable, performance-driven alternative to traditional synthetic oversampling and self-training, with clear avenues for extending similarity metrics, applying across domains, and integrating active learning.

Abstract

In real-world applications, as data availability increases, obtaining labeled data for machine learning (ML) projects remains challenging due to the high costs and intensive efforts required for data annotation. Many ML projects, particularly those focused on multi-label classification, also grapple with data imbalance issues, where certain classes may lack sufficient data to train effective classifiers. This study introduces and examines a novel oversampling method for multi-label text classification, designed to address performance challenges associated with data imbalance. The proposed method identifies potential new samples from unlabeled data by leveraging similarity measures between instances. By iteratively searching the unlabeled dataset, the method locates instances similar to those in underrepresented classes and evaluates their contribution to classifier performance enhancement. Instances that demonstrate performance improvement are then added to the labeled dataset. Experimental results indicate that the proposed approach effectively enhances classifier performance post-oversampling.

A Similarity-Based Oversampling Method for Multi-label Imbalanced Text Data

TL;DR

The paper addresses the challenge of labeled data scarcity and severe class imbalance in multi-label text classification. It introduces a similarity-based oversampling approach that sources new labeled instances from unlabeled data by leveraging embedding-based similarity and only retaining candidates that demonstrably improve classifier performance. On the OPP-115 dataset, the method raises the -score from to , a gain, after adding instances. This approach offers a scalable, performance-driven alternative to traditional synthetic oversampling and self-training, with clear avenues for extending similarity metrics, applying across domains, and integrating active learning.

Abstract

In real-world applications, as data availability increases, obtaining labeled data for machine learning (ML) projects remains challenging due to the high costs and intensive efforts required for data annotation. Many ML projects, particularly those focused on multi-label classification, also grapple with data imbalance issues, where certain classes may lack sufficient data to train effective classifiers. This study introduces and examines a novel oversampling method for multi-label text classification, designed to address performance challenges associated with data imbalance. The proposed method identifies potential new samples from unlabeled data by leveraging similarity measures between instances. By iteratively searching the unlabeled dataset, the method locates instances similar to those in underrepresented classes and evaluates their contribution to classifier performance enhancement. Instances that demonstrate performance improvement are then added to the labeled dataset. Experimental results indicate that the proposed approach effectively enhances classifier performance post-oversampling.

Paper Structure

This paper contains 11 sections, 7 equations, 7 figures, 3 tables, 2 algorithms.

Figures (7)

  • Figure 1: The number of publications in imbalanced learning (Reprinted from chen2024survey).
  • Figure 2: Embedding representation for some words.
  • Figure 3: A representation of the labeled and unlabeled instances.
  • Figure 4: Flow chart for the overall algorithm.
  • Figure 5: Flow chart for oversampling algorithm.
  • ...and 2 more figures