Table of Contents
Fetching ...

Latent-based Diffusion Model for Long-tailed Recognition

Pengxiao Han, Changkun Ye, Jieming Zhou, Jing Zhang, Jie Hong, Xuesong Li

TL;DR

This paper tackles long-tailed recognition by introducing LDMLR, a three-stage approach that augments minority-class representations with diffusion-generated latent features. By operating in the latent feature space, LDMLR uses a class-conditional DDIM/LDM to produce pseudo-features and then jointly trains a classifier on real and generated embeddings. Empirical results on CIFAR-LT and ImageNet-LT show consistent improvements over strong baselines, with latent augmentation outperforming image-space diffusion and focused tail-class augmentation providing the largest gains. The method is efficient due to latent-space diffusion and demonstrates the potential of diffusion models for enhancing imbalanced visual recognition in practical settings.

Abstract

Long-tailed imbalance distribution is a common issue in practical computer vision applications. Previous works proposed methods to address this problem, which can be categorized into several classes: re-sampling, re-weighting, transfer learning, and feature augmentation. In recent years, diffusion models have shown an impressive generation ability in many sub-problems of deep computer vision. However, its powerful generation has not been explored in long-tailed problems. We propose a new approach, the Latent-based Diffusion Model for Long-tailed Recognition (LDMLR), as a feature augmentation method to tackle the issue. First, we encode the imbalanced dataset into features using the baseline model. Then, we train a Denoising Diffusion Implicit Model (DDIM) using these encoded features to generate pseudo-features. Finally, we train the classifier using the encoded and pseudo-features from the previous two steps. The model's accuracy shows an improvement on the CIFAR-LT and ImageNet-LT datasets by using the proposed method.

Latent-based Diffusion Model for Long-tailed Recognition

TL;DR

This paper tackles long-tailed recognition by introducing LDMLR, a three-stage approach that augments minority-class representations with diffusion-generated latent features. By operating in the latent feature space, LDMLR uses a class-conditional DDIM/LDM to produce pseudo-features and then jointly trains a classifier on real and generated embeddings. Empirical results on CIFAR-LT and ImageNet-LT show consistent improvements over strong baselines, with latent augmentation outperforming image-space diffusion and focused tail-class augmentation providing the largest gains. The method is efficient due to latent-space diffusion and demonstrates the potential of diffusion models for enhancing imbalanced visual recognition in practical settings.

Abstract

Long-tailed imbalance distribution is a common issue in practical computer vision applications. Previous works proposed methods to address this problem, which can be categorized into several classes: re-sampling, re-weighting, transfer learning, and feature augmentation. In recent years, diffusion models have shown an impressive generation ability in many sub-problems of deep computer vision. However, its powerful generation has not been explored in long-tailed problems. We propose a new approach, the Latent-based Diffusion Model for Long-tailed Recognition (LDMLR), as a feature augmentation method to tackle the issue. First, we encode the imbalanced dataset into features using the baseline model. Then, we train a Denoising Diffusion Implicit Model (DDIM) using these encoded features to generate pseudo-features. Finally, we train the classifier using the encoded and pseudo-features from the previous two steps. The model's accuracy shows an improvement on the CIFAR-LT and ImageNet-LT datasets by using the proposed method.
Paper Structure (16 sections, 12 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 12 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the proposed framework, LDMLR. The figure describes the training of the framework: (a) obtain encoded features by a pre-training convolutional neural network on the long-tailed training set, (b) Generate pseudo-features by the diffusion model using encoded features, and (c) Train the fully connected layers using encoded and pseudo-features. The encoder from (a) and the classifier from (c) are used to predict long-tailed data in the evaluation stage.
  • Figure 2: The impact of generation ratio on classification accuracy. The evaluation is conducted on CIFAR-10-LT and CIFAR-100-LT with $\mathrm{IF}=10$.
  • Figure 3: The encoded and generated features of tail class (class 9) in CIFAR-10-LT during the model training. From the figure, the generated features (blue points) can overlay the encoded features (red points) from the original training dataset while slightly enriching the feature space.