Table of Contents
Fetching ...

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Jihai Zhang, Xiang Lan, Xiaoye Qu, Yu Cheng, Mengling Feng, Bryan Hooi

TL;DR

This paper tackles feature suppression in self-supervised contrastive learning, where models miss substantial input information hindering semantic understanding. It introduces Multistage Contrastive Learning (MCL), a model-agnostic framework that uses feature-aware negative sampling across stages and cross-stage representation integration to progressively uncover previously unlearned features while preserving learned ones. Across unimodal backbones (e.g., ResNet-18, STL-10) and multimodal CLIP setups (ResNet and ViT backbones on CC12M/MMVP), MCL yields consistent improvements, including substantial gains on MMVP (ResNet: 19.3→24.4; ViT: 20.0→32.6) and notable attribute-level enhancements, demonstrating its ability to diversify feature discovery and maintain prior information. The results suggest MCL as a scalable, architecture-agnostic approach to enhance representation learning in both single- and multi-modal contexts, with promising avenues for optimizing cross-stage fusion and clustering dynamics.

Abstract

Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data. However, a major challenge that hinders both unimodal and multimodal contrastive learning is feature suppression, a phenomenon where the trained model captures only a limited portion of the information from the input data while overlooking other potentially valuable content. This issue often leads to indistinguishable representations for visually similar but semantically different inputs, adversely affecting downstream task performance, particularly those requiring rigorous semantic comprehension. To address this challenge, we propose a novel model-agnostic Multistage Contrastive Learning (MCL) framework. Unlike standard contrastive learning which inherently captures one single biased feature distribution, MCL progressively learns previously unlearned features through feature-aware negative sampling at each stage, where the negative samples of an anchor are exclusively selected from the cluster it was assigned to in preceding stages. Meanwhile, MCL preserves the previously well-learned features by cross-stage representation integration, integrating features across all stages to form final representations. Our comprehensive evaluation demonstrates MCL's effectiveness and superiority across both unimodal and multimodal contrastive learning, spanning a range of model architectures from ResNet to Vision Transformers (ViT). Remarkably, in tasks where the original CLIP model has shown limitations, MCL dramatically enhances performance, with improvements up to threefold on specific attributes in the recently proposed MMVP benchmark.

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

TL;DR

This paper tackles feature suppression in self-supervised contrastive learning, where models miss substantial input information hindering semantic understanding. It introduces Multistage Contrastive Learning (MCL), a model-agnostic framework that uses feature-aware negative sampling across stages and cross-stage representation integration to progressively uncover previously unlearned features while preserving learned ones. Across unimodal backbones (e.g., ResNet-18, STL-10) and multimodal CLIP setups (ResNet and ViT backbones on CC12M/MMVP), MCL yields consistent improvements, including substantial gains on MMVP (ResNet: 19.3→24.4; ViT: 20.0→32.6) and notable attribute-level enhancements, demonstrating its ability to diversify feature discovery and maintain prior information. The results suggest MCL as a scalable, architecture-agnostic approach to enhance representation learning in both single- and multi-modal contexts, with promising avenues for optimizing cross-stage fusion and clustering dynamics.

Abstract

Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data. However, a major challenge that hinders both unimodal and multimodal contrastive learning is feature suppression, a phenomenon where the trained model captures only a limited portion of the information from the input data while overlooking other potentially valuable content. This issue often leads to indistinguishable representations for visually similar but semantically different inputs, adversely affecting downstream task performance, particularly those requiring rigorous semantic comprehension. To address this challenge, we propose a novel model-agnostic Multistage Contrastive Learning (MCL) framework. Unlike standard contrastive learning which inherently captures one single biased feature distribution, MCL progressively learns previously unlearned features through feature-aware negative sampling at each stage, where the negative samples of an anchor are exclusively selected from the cluster it was assigned to in preceding stages. Meanwhile, MCL preserves the previously well-learned features by cross-stage representation integration, integrating features across all stages to form final representations. Our comprehensive evaluation demonstrates MCL's effectiveness and superiority across both unimodal and multimodal contrastive learning, spanning a range of model architectures from ResNet to Vision Transformers (ViT). Remarkably, in tasks where the original CLIP model has shown limitations, MCL dramatically enhances performance, with improvements up to threefold on specific attributes in the recently proposed MMVP benchmark.
Paper Structure (26 sections, 5 equations, 7 figures, 11 tables, 1 algorithm)

This paper contains 26 sections, 5 equations, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: Demonstration of feature suppression in both unimodal (SimCLR) and multimodal (CLIP) settings. The green arrows refer to correct linear evaluation classification/ pairing; the red arrows refer to incorrect ones.
  • Figure 2: Overview of the Multistage Contrastive Learning (MCL) Framework: Initially, the model is trained and its output representations are clustered. At each subsequent stage, cluster assignments from the previous stages are concatenated to derive a pseudo label. Throughout the training phase, negative samples are selected based on their matching pseudo label with the anchor. The final representations are the concatenation of representations in each stage. For simplicity, only three stages are shown here. The 'color', 'shape', and 'texture' here are used metaphorically to represent abstract features.
  • Figure 3: In each stage, the top 3 most similar samples to the anchor, which demonstrates the model’s shifting focus from color, texture, to shape across different stages of MCL.
  • Figure 4: Linear evaluation accuracy on STL-10 dataset by incorporating MCL with SimCLR and MoCo-v2.
  • Figure 5: Linear evaluation results of MCL on CIFAR-MNIST for different training stages.
  • ...and 2 more figures