Table of Contents
Fetching ...

CLIC: Contrastive Learning Framework for Unsupervised Image Complexity Representation

Shipeng Liu, Liang Zhao, Dengfeng Chen

TL;DR

CLIC addresses the challenge of quantifying image complexity without manual labels by learning IC representations through a dual-encoder contrastive framework. It introduces a complexity-aware loss guided by a global entropy prior and a novel positive/negative sampling strategy tailored to IC, enabling unsupervised learning from large unlabeled corpora and effective fine-tuning on IC9600 with only a small labeled subset. Ablation and downstream-task results show that CLIC yields strong IC representations, improves performance in detection and segmentation tasks, and reduces reliance on subjective human annotations. The work offers a scalable, bias-free approach to perceptual image complexity that can enhance a range of CV pipelines.

Abstract

As a fundamental visual attribute, image complexity significantly influences both human perception and the performance of computer vision models. However, accurately assessing and quantifying image complexity remains a challenging task. (1) Traditional metrics such as information entropy and compression ratio often yield coarse and unreliable estimates. (2) Data-driven methods require expensive manual annotations and are inevitably affected by human subjective biases. To address these issues, we propose CLIC, an unsupervised framework based on Contrastive Learning for learning Image Complexity representations. CLIC learns complexity-aware features from unlabeled data, thereby eliminating the need for costly labeling. Specifically, we design a novel positive and negative sample selection strategy to enhance the discrimination of complexity features. Additionally, we introduce a complexity-aware loss function guided by image priors to further constrain the learning process. Extensive experiments validate the effectiveness of CLIC in capturing image complexity. When fine-tuned with a small number of labeled samples from IC9600, CLIC achieves performance competitive with supervised methods. Moreover, applying CLIC to downstream tasks consistently improves performance. Notably, both the pretraining and application processes of CLIC are free from subjective bias.

CLIC: Contrastive Learning Framework for Unsupervised Image Complexity Representation

TL;DR

CLIC addresses the challenge of quantifying image complexity without manual labels by learning IC representations through a dual-encoder contrastive framework. It introduces a complexity-aware loss guided by a global entropy prior and a novel positive/negative sampling strategy tailored to IC, enabling unsupervised learning from large unlabeled corpora and effective fine-tuning on IC9600 with only a small labeled subset. Ablation and downstream-task results show that CLIC yields strong IC representations, improves performance in detection and segmentation tasks, and reduces reliance on subjective human annotations. The work offers a scalable, bias-free approach to perceptual image complexity that can enhance a range of CV pipelines.

Abstract

As a fundamental visual attribute, image complexity significantly influences both human perception and the performance of computer vision models. However, accurately assessing and quantifying image complexity remains a challenging task. (1) Traditional metrics such as information entropy and compression ratio often yield coarse and unreliable estimates. (2) Data-driven methods require expensive manual annotations and are inevitably affected by human subjective biases. To address these issues, we propose CLIC, an unsupervised framework based on Contrastive Learning for learning Image Complexity representations. CLIC learns complexity-aware features from unlabeled data, thereby eliminating the need for costly labeling. Specifically, we design a novel positive and negative sample selection strategy to enhance the discrimination of complexity features. Additionally, we introduce a complexity-aware loss function guided by image priors to further constrain the learning process. Extensive experiments validate the effectiveness of CLIC in capturing image complexity. When fine-tuned with a small number of labeled samples from IC9600, CLIC achieves performance competitive with supervised methods. Moreover, applying CLIC to downstream tasks consistently improves performance. Notably, both the pretraining and application processes of CLIC are free from subjective bias.

Paper Structure

This paper contains 29 sections, 10 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: The CLIC framework comprises a query encoder and a key encoder that share the same architecture. The parameters of the query encoder are updated via standard backpropagation, while the key encoder is updated using momentum-based updates. Each image in a mini-batch is pre-computed with its global entropy (ge) as prior information. To guide the model towards learning intrinsic image features rather than category- or object-specific attributes, we extract the feature activation energy (fae) from the last down-sampling layer (or the final stage of the encoder). This fae is then combined with the image's ge to compute the Complexity-Aware Loss. This loss encourages the model to focus on structural and informational content that reflects image complexity, rather than semantic content.
  • Figure 1: CLIC fint-tuning pipeline.
  • Figure 2: Image Complexity Distribution of Datasets.(a) global entropy of MS COCO mscoco. (b) ICD of MS COCO mscoco. (c) ICD of PASCAL VOC 46. (d) ICD of Cornell Grasp 44. The global entropy is obtained by Eq.(2). ICD stands for image complexity distribution, which represents the true complexity of each image. In this work, it is obtained by conducting statistics on the inference of each image by a well-trained ICNet.
  • Figure 3: Our overall architecture.(a) Detail structure of CLIC. (b) Positive and Negative Samples Selection. We crop the mini-batch image to produce views whose image complexity is close ($\sim$ IC score) to that of the original as positive samples. Outside the mini-batch are negative samples.
  • Figure 4: Small-scale crop and merge. Given an image of size $(h,w)$, the crop size is $(h,w)/c$, $(h,w)/2c$, and $(h,w)/4c$, the corresponding number of crops for each image is $c$, $2c$, and $4c$, respectively. Then, we randomly merge two crops of the same size and their two transformed crops.
  • ...and 10 more figures