Table of Contents
Fetching ...

Deep Fusion: Capturing Dependencies in Contrastive Learning via Transformer Projection Heads

Huanran Li, Daniel Pimentel-Alarcón

TL;DR

This work investigates replacing the standard feed-forward projection head in contrastive learning with a Transformer-based projection head to capture long-range dependencies among embeddings. It introduces Deep Fusion, an unsupervised phenomenon where attention layers progressively group samples from the same class, and provides a theoretical framework for this behavior. Empirically, Transformer projection heads yield improvements over FFN heads across CIFAR-10/100 and ImageNet-200, with notable gains in both supervised and unsupervised evaluations and with ablations clarifying optimal batch size, temperature, and weight decay. The results suggest that attention-based projection heads are a promising direction for enhancing self-supervised representation learning in vision.

Abstract

Contrastive Learning (CL) has emerged as a powerful method for training feature extraction models using unlabeled data. Recent studies suggest that incorporating a linear projection head post-backbone significantly enhances model performance. In this work, we investigate the use of a transformer model as a projection head within the CL framework, aiming to exploit the transformer's capacity for capturing long-range dependencies across embeddings to further improve performance. Our key contributions are fourfold: First, we introduce a novel application of transformers in the projection head role for contrastive learning, marking the first endeavor of its kind. Second, our experiments reveal a compelling "Deep Fusion" phenomenon where the attention mechanism progressively captures the correct relational dependencies among samples from the same class in deeper layers. Third, we provide a theoretical framework that explains and supports this "Deep Fusion" behavior. Finally, we demonstrate through experimental results that our model achieves superior performance compared to the existing approach of using a feed-forward layer.

Deep Fusion: Capturing Dependencies in Contrastive Learning via Transformer Projection Heads

TL;DR

This work investigates replacing the standard feed-forward projection head in contrastive learning with a Transformer-based projection head to capture long-range dependencies among embeddings. It introduces Deep Fusion, an unsupervised phenomenon where attention layers progressively group samples from the same class, and provides a theoretical framework for this behavior. Empirically, Transformer projection heads yield improvements over FFN heads across CIFAR-10/100 and ImageNet-200, with notable gains in both supervised and unsupervised evaluations and with ablations clarifying optimal batch size, temperature, and weight decay. The results suggest that attention-based projection heads are a promising direction for enhancing self-supervised representation learning in vision.

Abstract

Contrastive Learning (CL) has emerged as a powerful method for training feature extraction models using unlabeled data. Recent studies suggest that incorporating a linear projection head post-backbone significantly enhances model performance. In this work, we investigate the use of a transformer model as a projection head within the CL framework, aiming to exploit the transformer's capacity for capturing long-range dependencies across embeddings to further improve performance. Our key contributions are fourfold: First, we introduce a novel application of transformers in the projection head role for contrastive learning, marking the first endeavor of its kind. Second, our experiments reveal a compelling "Deep Fusion" phenomenon where the attention mechanism progressively captures the correct relational dependencies among samples from the same class in deeper layers. Third, we provide a theoretical framework that explains and supports this "Deep Fusion" behavior. Finally, we demonstrate through experimental results that our model achieves superior performance compared to the existing approach of using a feed-forward layer.
Paper Structure (12 sections, 2 theorems, 27 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 12 sections, 2 theorems, 27 equations, 1 figure, 2 tables, 1 algorithm.

Key Result

Theorem 1

Given a collection of input samples $\mathbf{X}$, where every sample $\mathbf{x}_i$ comes from one subspaces among $\{\mathcal{U}_1, ..., \mathcal{U}_\mathcal{K}\}$, there always exists a pair of parameters $(\mathbf{W}^{Q*}, \mathbf{W}^{K*})$ such that the attention matrix $\mathbf{A}$ calculated b where $\nu_i$ is the number of samples in same cluster as $\mathbf{x}_i$, and $\rho$ is the Class I

Figures (1)

  • Figure 1: Deep Fusion within the Transformer Projection Head for Contrastive Learning. Embeddings from the backbone network are transformed into a sequence. As the process unfolds, the attention mechanism progressively identifies and amplifies relational dependencies among samples of the same class in deeper layers. This indicates an unsupervised 'fusion' of samples, drawing them closer to each other based on class similarity, without the need for explicit label supervision.

Theorems & Definitions (6)

  • Definition 1
  • Theorem 1
  • proof
  • Definition 2
  • Theorem 2
  • proof