Table of Contents
Fetching ...

Beyond Random Augmentations: Pretraining with Hard Views

Fabio Ferreira, Ivo Rapant, Jörg K. H. Franke, Frank Hutter

TL;DR

This work introduces Hard View Pretraining (HVP), a learning-free strategy that adversarially selects the hardest pair of augmentations during SSL pretraining to push the model beyond easy views. By sampling $N$ views per image and choosing the pair with maximal loss for each update, HVP consistently improves linear evaluation and transfer performance across SimSiam, DINO, iBOT, and SimCLR, and scales to ImageNet-1k with ConvNets and ViTs. Notably, it sets a new state-of-the-art 78.8% linear accuracy for DINO ViT-B/16 (400 epochs) and yields ~1% average gains across 100/300-epoch pretraining, with robust transfer to downstream tasks like object detection and segmentation. The approach is simple to implement, requires no extra learnable components, and demonstrates resilience to augmentation hyperparameters, offering a practical route to stronger SSL representations for large-scale vision models.

Abstract

Self-Supervised Learning (SSL) methods typically rely on random image augmentations, or views, to make models invariant to different transformations. We hypothesize that the efficacy of pretraining pipelines based on conventional random view sampling can be enhanced by explicitly selecting views that benefit the learning progress. A simple yet effective approach is to select hard views that yield a higher loss. In this paper, we propose Hard View Pretraining (HVP), a learning-free strategy that extends random view generation by exposing models to more challenging samples during SSL pretraining. HVP encompasses the following iterative steps: 1) randomly sample multiple views and forward each view through the pretrained model, 2) create pairs of two views and compute their loss, 3) adversarially select the pair yielding the highest loss according to the current model state, and 4) perform a backward pass with the selected pair. In contrast to existing hard view literature, we are the first to demonstrate hard view pretraining's effectiveness at scale, particularly training on the full ImageNet-1k dataset, and evaluating across multiple SSL methods, ConvNets, and ViTs. As a result, HVP sets a new state-of-the-art on DINO ViT-B/16, reaching 78.8% linear evaluation accuracy (a 0.6% improvement) and consistent gains of 1% for both 100 and 300 epoch pretraining, with similar improvements across transfer tasks in DINO, SimSiam, iBOT, and SimCLR.

Beyond Random Augmentations: Pretraining with Hard Views

TL;DR

This work introduces Hard View Pretraining (HVP), a learning-free strategy that adversarially selects the hardest pair of augmentations during SSL pretraining to push the model beyond easy views. By sampling views per image and choosing the pair with maximal loss for each update, HVP consistently improves linear evaluation and transfer performance across SimSiam, DINO, iBOT, and SimCLR, and scales to ImageNet-1k with ConvNets and ViTs. Notably, it sets a new state-of-the-art 78.8% linear accuracy for DINO ViT-B/16 (400 epochs) and yields ~1% average gains across 100/300-epoch pretraining, with robust transfer to downstream tasks like object detection and segmentation. The approach is simple to implement, requires no extra learnable components, and demonstrates resilience to augmentation hyperparameters, offering a practical route to stronger SSL representations for large-scale vision models.

Abstract

Self-Supervised Learning (SSL) methods typically rely on random image augmentations, or views, to make models invariant to different transformations. We hypothesize that the efficacy of pretraining pipelines based on conventional random view sampling can be enhanced by explicitly selecting views that benefit the learning progress. A simple yet effective approach is to select hard views that yield a higher loss. In this paper, we propose Hard View Pretraining (HVP), a learning-free strategy that extends random view generation by exposing models to more challenging samples during SSL pretraining. HVP encompasses the following iterative steps: 1) randomly sample multiple views and forward each view through the pretrained model, 2) create pairs of two views and compute their loss, 3) adversarially select the pair yielding the highest loss according to the current model state, and 4) perform a backward pass with the selected pair. In contrast to existing hard view literature, we are the first to demonstrate hard view pretraining's effectiveness at scale, particularly training on the full ImageNet-1k dataset, and evaluating across multiple SSL methods, ConvNets, and ViTs. As a result, HVP sets a new state-of-the-art on DINO ViT-B/16, reaching 78.8% linear evaluation accuracy (a 0.6% improvement) and consistent gains of 1% for both 100 and 300 epoch pretraining, with similar improvements across transfer tasks in DINO, SimSiam, iBOT, and SimCLR.
Paper Structure (49 sections, 6 equations, 10 figures, 14 tables, 1 algorithm)

This paper contains 49 sections, 6 equations, 10 figures, 14 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) HVP first samples $N$ views, pairs them, and adversarially selects the hardest pair, i.e., the one with the worst loss according to the current model state. (b) Examples (left) and sampled views (right) after transformations. Hard pairs selected by HVP are shown with a solid frame.
  • Figure 2: Left: In over 40% of the cases, the adversarially selected view pair has also the lowest Intersection over Union throughout SimSiam+HVP pretraining. We attribute the early spike to the random initialization of the embedding. Right: HVP (blue) shows a shift to smaller IoU values over standard pretraining (orange). Both results are based on 3 seeds.
  • Figure 3: Left: The average IoU of view pairs selected by SimSiam+HVP (blue) compared against the default SimSiam training (green). Right: Using static color augmentation for all pairs before the selection increases the dependency on the IoU.
  • Figure 4: With HVP, SimSiam appears more robust to augmentation hyperparameter variation. We show this for RandomResizedCrop (left) and ColorJitter (right). For RRC, the values indicate the lower value of the sampling range and for CJ the intensity of the color cues. Results averaged over two seeds and SimSiam defaults are RRC=0.2 and CJ=0.5.
  • Figure 5: We depict row-wise ten example images from the ImageNet train set along with their sampled views with a finished, 100-epoch trained SimSiam ResNet50 model. Left: original image with the overlaid randomly sampled crops (colored dashed rectangles). Right: All views after applying resizing and appearance augmentations. The pair that is selected adversarially by HVP is highlighted in solid lines, eg. View 1 and View 4 in the first row.
  • ...and 5 more figures