Table of Contents
Fetching ...

Contrastive Learning with Consistent Representations

Zihu Wang, Yu Wang, Zhuotong Chen, Hanbin Hu, Peng Li

TL;DR

Contrastive representation learning often hinges on data augmentations (DAs), yet relying on fixed or manual augmentations can hurt generalization. The paper introduces Contrastive Learning with Consistent Representations (CoCor), which enforces DA consistency by linking augmentation strength to an optimal latent similarity $l_d^{*}(\bm v(A))$ via a Monotonic Mapping Neural Network (MMNN) $g_{\theta_d}(\cdot)$ and a dedicated consistency loss. A bi-level optimization trains the encoder alongside the MMNN while leveraging diverse composite augmentations $\bm{\Omega}_c$, yielding improved linear evaluation and transfer to object detection, with evidence of more semantically coherent and less collapsed representations. CoCor demonstrates superior performance over augmentation-aware and stronger-augmentation baselines and provides a principled framework for systematic integration of diverse data augmentations in contrastive learning.

Abstract

Contrastive learning demonstrates great promise for representation learning. Data augmentations play a critical role in contrastive learning by providing informative views of the data without necessitating explicit labels. Nonetheless, the efficacy of current methodologies heavily hinges on the quality of employed data augmentation (DA) functions, often chosen manually from a limited set of options. While exploiting diverse data augmentations is appealing, the complexities inherent in both DAs and representation learning can lead to performance deterioration. Addressing this challenge and facilitating the systematic incorporation of diverse data augmentations, this paper proposes Contrastive Learning with Consistent Representations CoCor. At the heart of CoCor is a novel consistency metric termed DA consistency. This metric governs the mapping of augmented input data to the representation space, ensuring that these instances are positioned optimally in a manner consistent with the applied intensity of the DA. Moreover, we propose to learn the optimal mapping locations as a function of DA, all while preserving a desired monotonic property relative to DA intensity. Experimental results demonstrate that CoCor notably enhances the generalizability and transferability of learned representations in comparison to baseline methods.

Contrastive Learning with Consistent Representations

TL;DR

Contrastive representation learning often hinges on data augmentations (DAs), yet relying on fixed or manual augmentations can hurt generalization. The paper introduces Contrastive Learning with Consistent Representations (CoCor), which enforces DA consistency by linking augmentation strength to an optimal latent similarity via a Monotonic Mapping Neural Network (MMNN) and a dedicated consistency loss. A bi-level optimization trains the encoder alongside the MMNN while leveraging diverse composite augmentations , yielding improved linear evaluation and transfer to object detection, with evidence of more semantically coherent and less collapsed representations. CoCor demonstrates superior performance over augmentation-aware and stronger-augmentation baselines and provides a principled framework for systematic integration of diverse data augmentations in contrastive learning.

Abstract

Contrastive learning demonstrates great promise for representation learning. Data augmentations play a critical role in contrastive learning by providing informative views of the data without necessitating explicit labels. Nonetheless, the efficacy of current methodologies heavily hinges on the quality of employed data augmentation (DA) functions, often chosen manually from a limited set of options. While exploiting diverse data augmentations is appealing, the complexities inherent in both DAs and representation learning can lead to performance deterioration. Addressing this challenge and facilitating the systematic incorporation of diverse data augmentations, this paper proposes Contrastive Learning with Consistent Representations CoCor. At the heart of CoCor is a novel consistency metric termed DA consistency. This metric governs the mapping of augmented input data to the representation space, ensuring that these instances are positioned optimally in a manner consistent with the applied intensity of the DA. Moreover, we propose to learn the optimal mapping locations as a function of DA, all while preserving a desired monotonic property relative to DA intensity. Experimental results demonstrate that CoCor notably enhances the generalizability and transferability of learned representations in comparison to baseline methods.
Paper Structure (38 sections, 19 equations, 7 figures, 13 tables, 1 algorithm)

This paper contains 38 sections, 19 equations, 7 figures, 13 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Left: An encoder trained with the standard contrastive loss can exhibit inconsistency, as different views of an instance are encouraged to be represented similarly in the feature space, irrespective of the actual difference between them. Right: A consistent encoder positions the vector of more strongly augmented data further away from that of the raw data. Here the rings represent points with varying similarities to the central representation vector. (b) Nearest-neighbor retrieval in the feature space on CUB-200 wah2011caltech and Flowers102 nilsback2008automated using pre-trained encoders. Existing contrastive methods he2020momentumchen2021exploring, which enforce invariance to all data augmentations, may inadvertently cluster dissimilar data closely in the feature space. However, by applying consistency, $\texttt{CoCor}$ ensures that only data sharing similar latent semantics are distributed closely in the latent space.
  • Figure 2: Overview of the Proposed Method. (a) The proposed property, DA consistency, ensures that data augmented with stronger augmentation is positioned farther from the original data compared to a weakly augmented view in the representation space. (b) The Monotonic Mapping Neural Network (MMNN) predicts the optimal latent similarity between an augmented view and the original data, using augmentation composition vectors as input. Stronger augmentation results in a smaller predicted latent similarity by the MMNN.
  • Figure 3: A composite augmentation is considered stronger than another if and only if it includes all components of the latter, along with additional basic augmentations.
  • Figure 4: (a): Latent similarities under different composite augmentation lengths of encoders pre-trained with and without consistency loss, and (b) singular values of the learned latent presentations, both evaluated on ImageNet-100.
  • Figure 5: Linear evaluation of encoders trained by CoCor on (a) MoCo and (b) SimSiam with single length $l$ of composite augmentations and a combination of length 1, 2, and 3.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Definition 1: Composite Data Augmentations
  • Definition 2: Composition Vector
  • Definition 3: Latent Similarity
  • Definition 4: Consistency Level