Table of Contents
Fetching ...

Addressing Multilabel Imbalance with an Efficiency-Focused Approach Using Diffusion Model-Generated Synthetic Samples

Francisco Charte, Miguel Ángel Dávila, María Dolores Pérez-Godoy, María José del Jesus

TL;DR

This work tackles the challenge of class-imbalanced multilabel learning by introducing MLDM, a diffusion-model-based oversampling method tailored to multilabel data. MLDM trains a diffusion model on the subset of minority-label samples and generates complete synthetic instances, including labelsets, to reduce imbalance without extensive nearest-neighbor searches. Across eight multilabel datasets and multiple classifiers, MLDM demonstrates competitive performance and notably improved efficiency compared with established MOAs, supported by statistical analysis. The approach offers a practical, model-agnostic preprocessing step with open-source implementation for broader adoption in multilabel learning tasks.

Abstract

Predictive models trained on imbalanced data tend to produce biased results. This problem is exacerbated when there is not just one output label, but a set of them. This is the case for multilabel learning (MLL) algorithms used to classify patterns, rank labels, or learn the distribution of outputs. Many solutions have been proposed in the literature. The one that can be applied universally, independent of the algorithm used to build the model, is data resampling. The generation of new instances associated with minority labels, so that empty areas of the feature space are filled, helps to improve the obtained models. The quality of these new instances depends on the algorithm used to generate them. In this paper, a diffusion model tailored to produce new instances for MLL data, called MLDM (\textit{MultiLabel Diffusion Model}), is proposed. Diffusion models have been mainly used to generate artificial images and videos. Our proposed MLDM is based on this type of models. The experiments conducted compare MLDM with several other MLL resampling algorithms. The results show that MLDM is competitive while it improves efficiency.

Addressing Multilabel Imbalance with an Efficiency-Focused Approach Using Diffusion Model-Generated Synthetic Samples

TL;DR

This work tackles the challenge of class-imbalanced multilabel learning by introducing MLDM, a diffusion-model-based oversampling method tailored to multilabel data. MLDM trains a diffusion model on the subset of minority-label samples and generates complete synthetic instances, including labelsets, to reduce imbalance without extensive nearest-neighbor searches. Across eight multilabel datasets and multiple classifiers, MLDM demonstrates competitive performance and notably improved efficiency compared with established MOAs, supported by statistical analysis. The approach offers a practical, model-agnostic preprocessing step with open-source implementation for broader adoption in multilabel learning tasks.

Abstract

Predictive models trained on imbalanced data tend to produce biased results. This problem is exacerbated when there is not just one output label, but a set of them. This is the case for multilabel learning (MLL) algorithms used to classify patterns, rank labels, or learn the distribution of outputs. Many solutions have been proposed in the literature. The one that can be applied universally, independent of the algorithm used to build the model, is data resampling. The generation of new instances associated with minority labels, so that empty areas of the feature space are filled, helps to improve the obtained models. The quality of these new instances depends on the algorithm used to generate them. In this paper, a diffusion model tailored to produce new instances for MLL data, called MLDM (\textit{MultiLabel Diffusion Model}), is proposed. Diffusion models have been mainly used to generate artificial images and videos. Our proposed MLDM is based on this type of models. The experiments conducted compare MLDM with several other MLL resampling algorithms. The results show that MLDM is competitive while it improves efficiency.
Paper Structure (27 sections, 16 equations, 8 figures, 10 tables)

This paper contains 27 sections, 16 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: How a DDPM works: In training, noise is gradually added to the input while learning the reverse process. Inference produces a data sample from noise using the learned steps.
  • Figure 2: The process begins with any MLD, from which minority samples are extracted and utilized to train the model. Subsequently, the model is capable of generating synthetic samples, beginning with Gaussian noise.
  • Figure 3: Frequency of the three most common labels (blue shades) and the three rarest labels (red shades) in each MLD. In most cases, the total number of instances with minority labels is almost negligible compared to the majority ones.
  • Figure 4: Heatmap of changes in MeanIR for each dataset and resampling method. Positive values denote an improvement, while negative ones indicate a deterioration (i.e. the level of imbalance has increased).
  • Figure 5: Radar plots showing classification performance by metric (row), resampling method (column) and classifier (vertex). The larger the area, the better the performance.
  • ...and 3 more figures