Table of Contents
Fetching ...

Contrastive Learning for Image Complexity Representation

Shipeng Liu, Liang Zhao, Dengfeng Chen, Zhanping Song

TL;DR

This work tackles the challenge of measuring and leveraging image complexity without costly manual annotations and human biases. It introduces CLIC, a MoCo v2-based contrastive learning framework for unsupervised image complexity representation, augmented by Random Crop and Mix (RCM) to capture multi-scale local and global complexity. Empirical results show CLIC approaches state-of-the-art supervised performance, especially with larger backbones and RCM-enabled data expansion, and it provides practical pipelines to boost object detection and semantic segmentation by injecting IC features as auxiliary priors. The approach offers data-efficient, annotation-free IC modeling with tangible improvements to core CV tasks, highlighting its potential for scalable, IC-aware vision systems.

Abstract

Quantifying and evaluating image complexity can be instrumental in enhancing the performance of various computer vision tasks. Supervised learning can effectively learn image complexity features from well-annotated datasets. However, creating such datasets requires expensive manual annotation costs. The models may learn human subjective biases from it. In this work, we introduce the MoCo v2 framework. We utilize contrastive learning to represent image complexity, named CLIC (Contrastive Learning for Image Complexity). We find that there are complexity differences between different local regions of an image, and propose Random Crop and Mix (RCM), which can produce positive samples consisting of multi-scale local crops. RCM can also expand the train set and increase data diversity without introducing additional data. We conduct extensive experiments with CLIC, comparing it with both unsupervised and supervised methods. The results demonstrate that the performance of CLIC is comparable to that of state-of-the-art supervised methods. In addition, we establish the pipelines that can apply CLIC to computer vision tasks to effectively improve their performance.

Contrastive Learning for Image Complexity Representation

TL;DR

This work tackles the challenge of measuring and leveraging image complexity without costly manual annotations and human biases. It introduces CLIC, a MoCo v2-based contrastive learning framework for unsupervised image complexity representation, augmented by Random Crop and Mix (RCM) to capture multi-scale local and global complexity. Empirical results show CLIC approaches state-of-the-art supervised performance, especially with larger backbones and RCM-enabled data expansion, and it provides practical pipelines to boost object detection and semantic segmentation by injecting IC features as auxiliary priors. The approach offers data-efficient, annotation-free IC modeling with tangible improvements to core CV tasks, highlighting its potential for scalable, IC-aware vision systems.

Abstract

Quantifying and evaluating image complexity can be instrumental in enhancing the performance of various computer vision tasks. Supervised learning can effectively learn image complexity features from well-annotated datasets. However, creating such datasets requires expensive manual annotation costs. The models may learn human subjective biases from it. In this work, we introduce the MoCo v2 framework. We utilize contrastive learning to represent image complexity, named CLIC (Contrastive Learning for Image Complexity). We find that there are complexity differences between different local regions of an image, and propose Random Crop and Mix (RCM), which can produce positive samples consisting of multi-scale local crops. RCM can also expand the train set and increase data diversity without introducing additional data. We conduct extensive experiments with CLIC, comparing it with both unsupervised and supervised methods. The results demonstrate that the performance of CLIC is comparable to that of state-of-the-art supervised methods. In addition, we establish the pipelines that can apply CLIC to computer vision tasks to effectively improve their performance.
Paper Structure (18 sections, 4 equations, 8 figures, 7 tables)

This paper contains 18 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Global Entropy of MS COCO 2017 45.
  • Figure 2: ICD of MS COCO 2017 45.
  • Figure 3: ICD of PASCAL VOC 2012 46.
  • Figure 4: ICD of Cornell Grasp 44.
  • Figure 6: Image Random Crop and Mix (RCM). When c is 2 and 3, we get 14 and 21 crops, respectively. Then we mix 2 original crops in same size and their 2 transformed crops. We finally get 7 and 11 new images, respectively.
  • ...and 3 more figures