Table of Contents
Fetching ...

Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

Longlong Jing, Yingli Tian

TL;DR

This survey analyzes self-supervised visual feature learning with deep ConvNets, focusing on learning from unlabeled data via pretext tasks and transferring to downstream vision tasks. It organizes methods by architecture, pretext-task categories (generation-based, context-based, free semantic label-based, cross-modal), datasets, and applications in both image and video domains, and provides quantitative comparisons across benchmarks. The findings show that image-based self-supervised features can closely approach supervised performance on some tasks (notably detection and segmentation) while video self-supervision trails behind, partly due to model and data scale challenges. The paper highlights practical implications for scalable pretraining, discusses reproducibility concerns, and outlines future directions such as synthetic/web data, spatiotemporal learning, cross-modal supervision, and multi-task pretraining to further close the gap to supervised learning. Overall, self-supervised learning emerges as a promising paradigm for leveraging vast unlabeled visual data to obtain transferable representations with real-world impact in recognition, localization, and beyond.

Abstract

Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the main components and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used image and video datasets and the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.

Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

TL;DR

This survey analyzes self-supervised visual feature learning with deep ConvNets, focusing on learning from unlabeled data via pretext tasks and transferring to downstream vision tasks. It organizes methods by architecture, pretext-task categories (generation-based, context-based, free semantic label-based, cross-modal), datasets, and applications in both image and video domains, and provides quantitative comparisons across benchmarks. The findings show that image-based self-supervised features can closely approach supervised performance on some tasks (notably detection and segmentation) while video self-supervision trails behind, partly due to model and data scale challenges. The paper highlights practical implications for scalable pretraining, discusses reproducibility concerns, and outlines future directions such as synthetic/web data, spatiotemporal learning, cross-modal supervision, and multi-task pretraining to further close the gap to supervised learning. Overall, self-supervised learning emerges as a promising paradigm for leveraging vast unlabeled visual data to obtain transferable representations with real-world impact in recognition, localization, and beyond.

Abstract

Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the main components and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used image and video datasets and the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.

Paper Structure

This paper contains 61 sections, 5 equations, 24 figures, 6 tables.

Figures (24)

  • Figure 1: The general pipeline of self-supervised learning. The visual feature is learned through the process of training ConvNets to solve a pre-defined pretext task. After self-supervised pretext task training finished, the learned parameters serve as a pre-trained model and are transferred to other downstream computer vision tasks by fine-tuning. The performance on these downstream tasks is used to evaluate the quality of the learned features. During the knowledge transfer for downstream tasks, the general features from only the first several layers are unusually transferred to downstream tasks.
  • Figure 2: The architecture of AlexNet AlexNet. The numbers indicate the number of channels of each feature map. Figure is reproduced based on AlexNet AlexNet.
  • Figure 3: The architecture of VGG VGG. Figure is reproduced based on VGG VGG.
  • Figure 4: The architecture of Residual block ResNet. The identity mapping can effectively reduce gradient vanishing and explosion which make the training of very deep network feasible. Figure is reproduced based on ResNet ResNet.
  • Figure 5: The architecture of Inception block GoogLeNet. Figure is reproduced based on GoogLeNet GoogLeNet.
  • ...and 19 more figures