Table of Contents
Fetching ...

DUEL: Duplicate Elimination on Active Memory for Self-Supervised Class-Imbalanced Learning

Won-Seok Choi, Hyundo Lee, Dong-Sig Han, Junseok Park, Heeyeon Koo, Byoung-Tak Zhang

TL;DR

The paper tackles poor generalization of self-supervised learning under long-tailed class distributions by introducing DUEL, an active memory framework. It combines memory-inspired Hebbian Metric Learning with a distinctiveness objective to selectively replace duplicated items, thereby enriching memory diversity without relying on per-sample labels. Theoretical results connect memory-augmented objectives to the canonical HML loss and provide a practical, GPU-friendly DUEL policy that enhances downstream robustness across CIFAR-10, STL-10, and ImageNet-LT while preserving intra-class structure. Empirically, DUEL improves entropy of memory class distributions and promotes better inter-class separation, demonstrating practical impact for SSL in real-world imbalanced settings.

Abstract

Recent machine learning algorithms have been developed using well-curated datasets, which often require substantial cost and resources. On the other hand, the direct use of raw data often leads to overfitting towards frequently occurring class information. To address class imbalances cost-efficiently, we propose an active data filtering process during self-supervised pre-training in our novel framework, Duplicate Elimination (DUEL). This framework integrates an active memory inspired by human working memory and introduces distinctiveness information, which measures the diversity of the data in the memory, to optimize both the feature extractor and the memory. The DUEL policy, which replaces the most duplicated data with new samples, aims to enhance the distinctiveness information in the memory and thereby mitigate class imbalances. We validate the effectiveness of the DUEL framework in class-imbalanced environments, demonstrating its robustness and providing reliable results in downstream tasks. We also analyze the role of the DUEL policy in the training process through various metrics and visualizations.

DUEL: Duplicate Elimination on Active Memory for Self-Supervised Class-Imbalanced Learning

TL;DR

The paper tackles poor generalization of self-supervised learning under long-tailed class distributions by introducing DUEL, an active memory framework. It combines memory-inspired Hebbian Metric Learning with a distinctiveness objective to selectively replace duplicated items, thereby enriching memory diversity without relying on per-sample labels. Theoretical results connect memory-augmented objectives to the canonical HML loss and provide a practical, GPU-friendly DUEL policy that enhances downstream robustness across CIFAR-10, STL-10, and ImageNet-LT while preserving intra-class structure. Empirically, DUEL improves entropy of memory class distributions and promotes better inter-class separation, demonstrating practical impact for SSL in real-world imbalanced settings.

Abstract

Recent machine learning algorithms have been developed using well-curated datasets, which often require substantial cost and resources. On the other hand, the direct use of raw data often leads to overfitting towards frequently occurring class information. To address class imbalances cost-efficiently, we propose an active data filtering process during self-supervised pre-training in our novel framework, Duplicate Elimination (DUEL). This framework integrates an active memory inspired by human working memory and introduces distinctiveness information, which measures the diversity of the data in the memory, to optimize both the feature extractor and the memory. The DUEL policy, which replaces the most duplicated data with new samples, aims to enhance the distinctiveness information in the memory and thereby mitigate class imbalances. We validate the effectiveness of the DUEL framework in class-imbalanced environments, demonstrating its robustness and providing reliable results in downstream tasks. We also analyze the role of the DUEL policy in the training process through various metrics and visualizations.
Paper Structure (40 sections, 9 theorems, 55 equations, 10 figures, 5 tables, 4 algorithms)

This paper contains 40 sections, 9 theorems, 55 equations, 10 figures, 5 tables, 4 algorithms.

Key Result

Proposition 1

Minimizing $D_{\text{KL}}(p(x,c)||q(x,c;f))$ is equivalent to minimizing $\mathcal{L}_{\text{HML}}(f;\mathcal{D})$, which can be derived as: where $\mathcal{I}_h(f;\mathcal{D})$ and $\mathcal{I}_d(f;\mathcal{D})$ are denoted as Hebbian information and Distinctiveness information respectively.

Figures (10)

  • Figure 1: Visualizations of the concepts of working memory and our proposed DUEL framework. (A) Real-world agent perceives data from the environment and maps the representation to solve the task. Working memory finds semantically duplicated signals and reduces them to maximizes the total amount of information. (B) Inspired by this cognitive process, we design the Duplicate Elimination (DUEL) framework. With mutual duplication probability, the representations form a graph structure (center) and are filtered out (right) to gradually maximize the distinctiveness information.
  • Figure 2: Conceptual Visualization of Hebbian Metric Learning. HML minimizes the Hebbian information while maximizing the distinctiveness information.
  • Figure 3: Visualization of general DUEL framework. Our method stores various data for the negative samples by Duplicate Elimination. The DUEL policy selects the most duplicated sample in memory (green) and replaces it with current data (purple).
  • Figure 4: Visualization of the performance enhancement in the linear probing task. In both D-MoCo and D-SimCLR, accuracies are gradually improved during the training steps. Especially in D-MoCo, the DUEL process can prevent the dramatical performance degradation with high $\rho_{\max}$.
  • Figure 5: t-SNE visualization of the active data filtering process with DUEL policy. (a) The representations extracted by the trained model along with their corresponding class. (b) The agent faces a dominant class (pink) that occurs more frequently than others. (c) The DUEL policy $\pi_{\text{DUEL}}$ replaces duplicated data with newer data and maximizes the distinctiveness information.
  • ...and 5 more figures

Theorems & Definitions (19)

  • Definition 1: Mutual duplication probability
  • Proposition 1: Hebbian Metric Learning
  • Proposition 2: HML Bound
  • Theorem 1: Optimality of M-HML
  • Definition 2: Duplicate Elimination
  • Definition 3: Message passing
  • Lemma 1: Joint distribution with density function
  • proof
  • Proposition 1: Hebbian Metric Learning
  • proof
  • ...and 9 more