Table of Contents
Fetching ...

Enhanced Long-Tailed Recognition with Contrastive CutMix Augmentation

Haolin Pan, Yong Guo, Mianjie Yu, Jian Chen

TL;DR

A Contrastive CutMix (ConCutMix) is proposed that constructs augmented samples with semantically consistent labels to boost the performance of long-tailed recognition, and compute the similarities between samples in the semantic space learned by contrastive learning, and use them to rectify the area-based labels.

Abstract

Real-world data often follows a long-tailed distribution, where a few head classes occupy most of the data and a large number of tail classes only contain very limited samples. In practice, deep models often show poor generalization performance on tail classes due to the imbalanced distribution. To tackle this, data augmentation has become an effective way by synthesizing new samples for tail classes. Among them, one popular way is to use CutMix that explicitly mixups the images of tail classes and the others, while constructing the labels according to the ratio of areas cropped from two images. However, the area-based labels entirely ignore the inherent semantic information of the augmented samples, often leading to misleading training signals. To address this issue, we propose a Contrastive CutMix (ConCutMix) that constructs augmented samples with semantically consistent labels to boost the performance of long-tailed recognition. Specifically, we compute the similarities between samples in the semantic space learned by contrastive learning, and use them to rectify the area-based labels. Experiments show that our ConCutMix significantly improves the accuracy on tail classes as well as the overall performance. For example, based on ResNeXt-50, we improve the overall accuracy on ImageNet-LT by 3.0% thanks to the significant improvement of 3.3% on tail classes. We highlight that the improvement also generalizes well to other benchmarks and models. Our code and pretrained models are available at https://github.com/PanHaulin/ConCutMix.

Enhanced Long-Tailed Recognition with Contrastive CutMix Augmentation

TL;DR

A Contrastive CutMix (ConCutMix) is proposed that constructs augmented samples with semantically consistent labels to boost the performance of long-tailed recognition, and compute the similarities between samples in the semantic space learned by contrastive learning, and use them to rectify the area-based labels.

Abstract

Real-world data often follows a long-tailed distribution, where a few head classes occupy most of the data and a large number of tail classes only contain very limited samples. In practice, deep models often show poor generalization performance on tail classes due to the imbalanced distribution. To tackle this, data augmentation has become an effective way by synthesizing new samples for tail classes. Among them, one popular way is to use CutMix that explicitly mixups the images of tail classes and the others, while constructing the labels according to the ratio of areas cropped from two images. However, the area-based labels entirely ignore the inherent semantic information of the augmented samples, often leading to misleading training signals. To address this issue, we propose a Contrastive CutMix (ConCutMix) that constructs augmented samples with semantically consistent labels to boost the performance of long-tailed recognition. Specifically, we compute the similarities between samples in the semantic space learned by contrastive learning, and use them to rectify the area-based labels. Experiments show that our ConCutMix significantly improves the accuracy on tail classes as well as the overall performance. For example, based on ResNeXt-50, we improve the overall accuracy on ImageNet-LT by 3.0% thanks to the significant improvement of 3.3% on tail classes. We highlight that the improvement also generalizes well to other benchmarks and models. Our code and pretrained models are available at https://github.com/PanHaulin/ConCutMix.
Paper Structure (12 sections, 9 equations, 13 figures, 12 tables, 1 algorithm)

This paper contains 12 sections, 9 equations, 13 figures, 12 tables, 1 algorithm.

Figures (13)

  • Figure 1: Difference between the area-based labels from CutMix and the scores of ConCutMix. These images are synthesized with the same area ratio but have different semantics. The image with a red box shows that the area-based label may be entirely wrong in terms of semantics. In contrast, the scores for ConCutMix are intuitively more consistent with semantics.
  • Figure 2: Visualization of the feature distribution on ImageNet, with each color representing a specific class. We compare ConCutMix with two popular augmentation methods, i.e., CutMix yun2019CutMix and CMO park2022majority, all of which are implemented on BCL zhu2022balanced. Clearly, our ConCutMix can effectively separate the classes with a significant difference in the number of samples. Besides, ConCutMix distributes the considered synthetic sample to a semantically appropriate position, whereas other methods distribute it near classes that are clearly semantically inconsistent.
  • Figure 3: Difference between the area-based labels and the scores of ConCutMix, in treating novel classes that are not used for CutMix. We show two synthetic samples may semantically belong to a novel class other than the considered classes in CutMix, while ConCutMix is able to capture the semantically consistent information for each synthetic sample.
  • Figure 4: Overall pipeline for long-tailed recognition with ConCutMix. We first perform CutMix with images sampled from a balanced sampler and a random sampler to synthesize samples. Balanced sampler expands tail classes by increasing the probability of being foreground images from them. Following BCL zhu2022balanced, we leverage contrastive learning to establish a semantic space, where ConCutMix can construct semantically consistent labels based on the similarities with learned class centers. Finally, ConCutMix simply incorporates semantically consistent labels into training to rectify the area-based labels generated by CutMix.
  • Figure 5: Details of rectifying area-based labels with semantically consistent labels. We first calculate similarities between a synthetic sample and all learned class centers. ConCutMix constructs semantically consistent labels based on the similarities with the TopK-similar classes. To alleviate the misleading from noisy semantic information due to sample scarcity of tailed classes, ConCutMix combines normalized semantically consistent label with area-based label under the control of confidence function $\gamma(\cdot)$.
  • ...and 8 more figures