Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning

Chengchao Shen; Jianzhong Chen; Jianxin Wang

Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning

Chengchao Shen, Jianzhong Chen, Jianxin Wang

TL;DR

This paper addresses the limitation of single-grained contrastive learning by introducing Multi-Grained Contrast (MGC), which learns image representations across multiple granularities from patch to image. It constructs delicate, overlap-based correspondences at granularity levels $c \\in \\{1,2,7,14\\}$ and optimizes a set of cross-entropy objectives to align representations from two views using a ViT backbone with a momentum encoder and stop-gradient. Across object detection, instance segmentation, semantic segmentation, scene parsing, and keypoint detection, MGC achieves state-of-the-art or competitive results while exhibiting strong data efficiency, demonstrated by substantial gains on COCO, ADE20K, VOC, and Cityscapes without requiring massive pretraining. The approach enhances transferability by capturing both global and fine-grained local patterns, with ablations and visualizations supporting its localization capabilities and practical impact for diverse vision tasks.

Abstract

The existing contrastive learning methods mainly focus on single-grained representation learning, e.g., part-level, object-level or scene-level ones, thus inevitably neglecting the transferability of representations on other granularity levels. In this paper, we aim to learn multi-grained representations, which can effectively describe the image on various granularity levels, thus improving generalization on extensive downstream tasks. To this end, we propose a novel Multi-Grained Contrast method (MGC) for unsupervised representation learning. Specifically, we construct delicate multi-grained correspondences between positive views and then conduct multi-grained contrast by the correspondences to learn more general unsupervised representations. Without pretrained on large-scale dataset, our method significantly outperforms the existing state-of-the-art methods on extensive downstream tasks, including object detection, instance segmentation, scene parsing, semantic segmentation and keypoint detection. Moreover, experimental results support the data-efficient property and excellent representation transferability of our method. The source code and trained weights are available at \url{https://github.com/visresearch/mgc}.

Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning

TL;DR

and optimizes a set of cross-entropy objectives to align representations from two views using a ViT backbone with a momentum encoder and stop-gradient. Across object detection, instance segmentation, semantic segmentation, scene parsing, and keypoint detection, MGC achieves state-of-the-art or competitive results while exhibiting strong data efficiency, demonstrated by substantial gains on COCO, ADE20K, VOC, and Cityscapes without requiring massive pretraining. The approach enhances transferability by capturing both global and fine-grained local patterns, with ablations and visualizations supporting its localization capabilities and practical impact for diverse vision tasks.

Abstract

Paper Structure (29 sections, 12 equations, 8 figures, 9 tables)

This paper contains 29 sections, 12 equations, 8 figures, 9 tables.

Introduction
Related Work
Image-Level Contrastive Learning
Region / Pixel-Level Contrastive Learning
Method
Multi-Grained Correspondences
Multi-Grained Contrast
Localization Analysis
Experiments
Experimental Settings
Datasets and Tasks
Network and Optimization
Experimental Results
Object Detection and Instance Segmentation
Semantic Segmentation
...and 14 more sections

Figures (8)

Figure 1: The focus on image granularities of various downstream tasks. For better transferability, general representations are required to cover a wide range of image semantic granularities.
Figure 2: Multi-grained correspondences. Adjacent $2 \times 2$ regions of granularity 1 are aggregated into a larger granularity.
Figure 3: The overview of Multi-Grained Contrast. First, two image views are fed into ViT backbone to obtain patch-wise representations. Then, the representations are aggregated into multi-grained ones and randomly sampled as a sparse sequence to reduce the cost of computation and memory. Finally, the sparse multi-grained representations are optimized by the delicate correspondence targets.
Figure 4: Localization analysis. The blue and red boxes refer to the patches from view 1 and view 2, respectively. In this case, the patch localization of view 2 can be obtained by the one of view 1.
Figure 5: The effect of pretraining epochs on object detection, instance segmentation and keypoint detection tasks of COCO dataset.
...and 3 more figures

Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning

TL;DR

Abstract

Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)