Table of Contents
Fetching ...

GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning

Xiaojie Li, Bei Wang, Jianlong Wu, Yue Yu, Liqiang Nie, Min Zhang

TL;DR

GenView++ tackles two central challenges in contrastive representation learning: the limited diversity and potential semantic drift of augmentations, and the lack of pair-level quality assessment during training. It introduces a multi-source adaptive view generation module (image-conditioned, text-conditioned, and image-text-conditioned) that offline synthesizes diverse, semantically aligned views, and a quality-driven contrastive learning mechanism that online reweights positive pairs based on semantic alignment and diversity. The framework is model-agnostic and yields consistent gains across vision and vision-language benchmarks, including notable improvements on ImageNet linear classification (+2.5% with MoCov2) and zero-shot cross-modal tasks (+12.31% over CLIP on average). These results demonstrate that high-quality, controllably generated positives can substantially improve cross-modal alignment and data efficiency, offering a scalable data-centric path for future representation learning.

Abstract

The success of contrastive learning depends on the construction and utilization of high-quality positive pairs. However, current methods face critical limitations on two fronts: on the construction side, both handcrafted and generative augmentations often suffer from limited diversity and risk semantic corruption; on the learning side, the absence of a quality assessment mechanism leads to suboptimal supervision where all pairs are treated equally. To tackle these challenges, we propose GenView++, a unified framework that addresses both fronts by introducing two synergistic innovations. To improve pair construction, GenView++ introduces a multi-source adaptive view generation mechanism to synthesize diverse yet semantically coherent views by dynamically modulating generative parameters across image-conditioned, text-conditioned, and image-text-conditioned strategies. Second, a quality-driven contrastive learning mechanism assesses each pair's semantic alignment and diversity to dynamically reweight their training contribution, prioritizing high-quality pairs while suppressing redundant or misaligned pairs. Extensive experiments demonstrate the effectiveness of GenView++ across both vision and vision-language tasks. For vision representation learning, it improves MoCov2 by +2.5% on ImageNet linear classification. For vision-language learning, it raises the average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets, and further improves Flickr30k text retrieval R@5 by +3.2%.

GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning

TL;DR

GenView++ tackles two central challenges in contrastive representation learning: the limited diversity and potential semantic drift of augmentations, and the lack of pair-level quality assessment during training. It introduces a multi-source adaptive view generation module (image-conditioned, text-conditioned, and image-text-conditioned) that offline synthesizes diverse, semantically aligned views, and a quality-driven contrastive learning mechanism that online reweights positive pairs based on semantic alignment and diversity. The framework is model-agnostic and yields consistent gains across vision and vision-language benchmarks, including notable improvements on ImageNet linear classification (+2.5% with MoCov2) and zero-shot cross-modal tasks (+12.31% over CLIP on average). These results demonstrate that high-quality, controllably generated positives can substantially improve cross-modal alignment and data efficiency, offering a scalable data-centric path for future representation learning.

Abstract

The success of contrastive learning depends on the construction and utilization of high-quality positive pairs. However, current methods face critical limitations on two fronts: on the construction side, both handcrafted and generative augmentations often suffer from limited diversity and risk semantic corruption; on the learning side, the absence of a quality assessment mechanism leads to suboptimal supervision where all pairs are treated equally. To tackle these challenges, we propose GenView++, a unified framework that addresses both fronts by introducing two synergistic innovations. To improve pair construction, GenView++ introduces a multi-source adaptive view generation mechanism to synthesize diverse yet semantically coherent views by dynamically modulating generative parameters across image-conditioned, text-conditioned, and image-text-conditioned strategies. Second, a quality-driven contrastive learning mechanism assesses each pair's semantic alignment and diversity to dynamically reweight their training contribution, prioritizing high-quality pairs while suppressing redundant or misaligned pairs. Extensive experiments demonstrate the effectiveness of GenView++ across both vision and vision-language tasks. For vision representation learning, it improves MoCov2 by +2.5% on ImageNet linear classification. For vision-language learning, it raises the average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets, and further improves Flickr30k text retrieval R@5 by +3.2%.

Paper Structure

This paper contains 48 sections, 22 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Motivation.(a) Standard contrastive learning methods rely on handcrafted augmentations, which provide limited diversity, risk semantic distortions, and lack pair-level quality control during training. (b) GenView++ addresses these limitations via (i) Multi-Source Adaptive View Generation synthesizes diverse and semantically aligned views, and (ii) Quality-Driven Contrastive Learning reweights supervision by assessing the quality of both image–image and image–text pairs.
  • Figure 2: Overview of GenView++.(a) Offline Multi-Source Adaptive View Generation: Synthesizes diverse views using three strategies: IC adjusts noise $\ell$ by foreground saliency; TC adjusts guidance $g$ by caption complexity; ITC adapts both. (b) Online Quality-Driven Contrastive Learning: Dynamically reweights pairs. High-quality pairs (high alignment, high diversity) receive higher weights $w_i$, while noisy pairs are suppressed.
  • Figure 3: T-SNE visualization of image features from 9 randomly selected CIFAR-100 classes. (a) Baseline with standard augmentations; (b) model with multi-source adaptive view generation; (c) full GenView++ with the additional quality-driven contrastive loss.
  • Figure 4: Image-Conditioned Adaptive Generation. The noise level $\ell$ is adapted to the image’s foreground proportion. (a) For low-foreground images, a low $\ell$ (green boxes) avoids semantic drift, object loss, or distortion (Col 1–3). (b) For high-foreground images, a high $\ell$ (blue boxes) enriches diversity, e.g., varying pose (Col 4), action (Col 5), and background (Col 6).
  • Figure 5: Text-Conditioned Adaptive Generation. The guidance scale $g$ is adjusted by caption complexity. (a) Detailed captions with high visual complexity: a low $g$ (green boxes) encourages diverse generations while preserving fine semantics. In contrast, a high $g$ over-constrains the output, yielding repetitive or rigid results. (b) Simple captions with low visual complexity: a high $g$ (blue boxes) strengthens text-image alignment and ensures key concepts are retained.
  • ...and 1 more figures