Table of Contents
Fetching ...

Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis

Imanol G. Estepa, Jesús M. Rodríguez-de-Vera, Ignacio Sarasúa, Bhalaji Nagarajan, Petia Radeva

TL;DR

Sorcen addresses the challenge of unifying representation learning and image synthesis in self-supervised learning by operating on precomputed semantic tokens and introducing Echo Contrast, which generates positive samples from the model's own reconstruction. It couples a semantic reconstruction objective with a contrastive objective via a teacher-student EMA framework to achieve strong discriminative and generative performance without online tokenization or heavy augmentations. Experiments on ImageNet-1k show Sorcen achieving state-of-the-art results across linear probing, unconditional generation, few-shot, and transfer learning, while providing substantial efficiency gains (~60% fewer GPU-hours) relative to prior unified SSL methods like MAGE. This work advances unified SSL by delivering a disk-efficient approach that balances generation and recognition and opens avenues for extending to additional semantic token spaces.

Abstract

While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they rely solely on semantic token reconstruction, which requires an external tokenizer during training -- introducing a significant overhead. In this work, we introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our Contrastive objective, "Echo Contrast", leverages the generative capabilities of Sorcen, eliminating the need for additional image crops or augmentations during training. Sorcen "generates" an echo sample in the semantic token space, forming the contrastive positive pair. Sorcen operates exclusively on precomputed tokens, eliminating the need for an online token transformation during training, thereby significantly reducing computational overhead. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively, while being 60.8% more efficient. Additionally, Sorcen surpasses previous single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.

Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis

TL;DR

Sorcen addresses the challenge of unifying representation learning and image synthesis in self-supervised learning by operating on precomputed semantic tokens and introducing Echo Contrast, which generates positive samples from the model's own reconstruction. It couples a semantic reconstruction objective with a contrastive objective via a teacher-student EMA framework to achieve strong discriminative and generative performance without online tokenization or heavy augmentations. Experiments on ImageNet-1k show Sorcen achieving state-of-the-art results across linear probing, unconditional generation, few-shot, and transfer learning, while providing substantial efficiency gains (~60% fewer GPU-hours) relative to prior unified SSL methods like MAGE. This work advances unified SSL by delivering a disk-efficient approach that balances generation and recognition and opens avenues for extending to additional semantic token spaces.

Abstract

While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they rely solely on semantic token reconstruction, which requires an external tokenizer during training -- introducing a significant overhead. In this work, we introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our Contrastive objective, "Echo Contrast", leverages the generative capabilities of Sorcen, eliminating the need for additional image crops or augmentations during training. Sorcen "generates" an echo sample in the semantic token space, forming the contrastive positive pair. Sorcen operates exclusively on precomputed tokens, eliminating the need for an online token transformation during training, thereby significantly reducing computational overhead. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively, while being 60.8% more efficient. Additionally, Sorcen surpasses previous single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.

Paper Structure

This paper contains 19 sections, 3 equations, 15 figures, 16 tables.

Figures (15)

  • Figure 1: Simplified Echo Contrast. For a given anchor input, Sorcen's decoder outputs a set of logits, that are filtered to create a logit distribution, from which diverse Echo tokens are sampled. These tokens are processed by a Teacher encoder before contrasting them against the anchor. Echo contrast enables single input contrastive learning, removing the need of pixel augmentations.
  • Figure 2: Echo sample visualization: Images on top correspond to decoded original token inputs that serve as an Anchor for the Echo samples (bottom). Echoes are extracted during training and decoded on a single decoding step for visualization purposes.
  • Figure 3: Left. Sorcen is conformed by two different objectives: (1) Semantic Reconstruction and (2) Echo Contrast. Semantic reconstruction objective masks (M) and drops (D) the input tokens before processing it by a Student encoder. The Decoder is trained to predict the original input. Using this prediction, Sorcen extracts the logits and use them to sample Echoes, which are the positive samples in Echo Contrast objective. These Echos are processed by a Teacher encoder and the loss is computed between the output of the Student and Teacher encoders. Right. Sorcen leverages an extra token to produce a learnable placeholder embedding for all masked and dropped tokens during the Mask&Drop phase. This ensures equal token counts for Decoder input/output, which is crucial for Semantic Reconstruction.
  • Figure 3: Transfer learning results (top-1 accuracy) for different datasets under 16-shot settings. Last column contains the average across datasets.
  • Figure 4: JSM visualization. Each cell represents a VQGAN semantic token. Gray cells are the "cropped-out" area.
  • ...and 10 more figures