Table of Contents
Fetching ...

Analysis of Spatial augmentation in Self-supervised models in the purview of training and test distributions

Abhishek Jha, Tinne Tuytelaars

TL;DR

A distance-based margin to the invariance loss for learning scene-centric representations for the downstream task on object-centric distribution is proposed, showing that as simple as a margin proportional to the pixel distance between the two spatial views in the scence-centric images can improve the learned representation.

Abstract

In this paper, we present an empirical study of typical spatial augmentation techniques used in self-supervised representation learning methods (both contrastive and non-contrastive), namely random crop and cutout. Our contributions are: (a) we dissociate random cropping into two separate augmentations, overlap and patch, and provide a detailed analysis on the effect of area of overlap and patch size to the accuracy on down stream tasks. (b) We offer an insight into why cutout augmentation does not learn good representation, as reported in earlier literature. Finally, based on these analysis, (c) we propose a distance-based margin to the invariance loss for learning scene-centric representations for the downstream task on object-centric distribution, showing that as simple as a margin proportional to the pixel distance between the two spatial views in the scence-centric images can improve the learned representation. Our study furthers the understanding of the spatial augmentations, and the effect of the domain-gap between the training augmentations and the test distribution.

Analysis of Spatial augmentation in Self-supervised models in the purview of training and test distributions

TL;DR

A distance-based margin to the invariance loss for learning scene-centric representations for the downstream task on object-centric distribution is proposed, showing that as simple as a margin proportional to the pixel distance between the two spatial views in the scence-centric images can improve the learned representation.

Abstract

In this paper, we present an empirical study of typical spatial augmentation techniques used in self-supervised representation learning methods (both contrastive and non-contrastive), namely random crop and cutout. Our contributions are: (a) we dissociate random cropping into two separate augmentations, overlap and patch, and provide a detailed analysis on the effect of area of overlap and patch size to the accuracy on down stream tasks. (b) We offer an insight into why cutout augmentation does not learn good representation, as reported in earlier literature. Finally, based on these analysis, (c) we propose a distance-based margin to the invariance loss for learning scene-centric representations for the downstream task on object-centric distribution, showing that as simple as a margin proportional to the pixel distance between the two spatial views in the scence-centric images can improve the learned representation. Our study furthers the understanding of the spatial augmentations, and the effect of the domain-gap between the training augmentations and the test distribution.
Paper Structure (7 sections, 1 equation, 5 figures)

This paper contains 7 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: Spatial augmentation: a) shows the overlap and patch augmentation schemes, red and blue rectangles shows the sampling regions for the two augmented views. b) shows cutout augmentation and our proposed cutout-blur augmentation. c) shows an example of scene-centric image, with multiple distinct semantic concept. Minimizing the invariance loss between views containing distinct concepts results in a noisy object-specific representation. To overcome this, we propose an invariance loss conditioned upon the distance between the two views. This distance-based margin relaxes the invariance criteria between the patches based on their inter view pixel distance.
  • Figure 2: Analysis of Random cropping augmentation: Evaluation of a) overlap augmentation and b) patch augmentation on STL10 coates2011analysis, CIFAR10, CIFAR100 krizhevsky2009learning (after 400 epochs) and Imagenet100 deng2009imagenet (after 200 epochs). c) Comparison of best performing models from overlap to that of patch augmentation. Number on top of each bar denotes the overlap and patch sizes corresponding to the best models.
  • Figure 3: Mutually exclusive crops on STL10 and CIFAR10: CIFAR10 shows a weak inverse correlation between accuracy and the area of the exclusive region, while STL10 shows no clear pattern, both measured at epoch 400.
  • Figure 4: Comparison of cutout augmentation with cutout-blur augmentation: Cutout-blur consistently outperforms cutout augmentation across different cutout sizes evaluated at the same epoch ($=100$). We do not include cutout size for which the blurring kernel is bigger than the cutout size.
  • Figure 5: Knn-evaluation on the object-centric CIFAR-10 dataset krizhevsky2009learning comparing SimSiam trained on the scene-centric MSCOCO dataset lin2014microsoft with no margin (vanilla), fixed-margin ($=0.2$), and distance-based margin.