Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning
Chengchao Shen, Jianzhong Chen, Jianxin Wang
TL;DR
This paper addresses the limitation of single-grained contrastive learning by introducing Multi-Grained Contrast (MGC), which learns image representations across multiple granularities from patch to image. It constructs delicate, overlap-based correspondences at granularity levels $c \\in \\{1,2,7,14\\}$ and optimizes a set of cross-entropy objectives to align representations from two views using a ViT backbone with a momentum encoder and stop-gradient. Across object detection, instance segmentation, semantic segmentation, scene parsing, and keypoint detection, MGC achieves state-of-the-art or competitive results while exhibiting strong data efficiency, demonstrated by substantial gains on COCO, ADE20K, VOC, and Cityscapes without requiring massive pretraining. The approach enhances transferability by capturing both global and fine-grained local patterns, with ablations and visualizations supporting its localization capabilities and practical impact for diverse vision tasks.
Abstract
The existing contrastive learning methods mainly focus on single-grained representation learning, e.g., part-level, object-level or scene-level ones, thus inevitably neglecting the transferability of representations on other granularity levels. In this paper, we aim to learn multi-grained representations, which can effectively describe the image on various granularity levels, thus improving generalization on extensive downstream tasks. To this end, we propose a novel Multi-Grained Contrast method (MGC) for unsupervised representation learning. Specifically, we construct delicate multi-grained correspondences between positive views and then conduct multi-grained contrast by the correspondences to learn more general unsupervised representations. Without pretrained on large-scale dataset, our method significantly outperforms the existing state-of-the-art methods on extensive downstream tasks, including object detection, instance segmentation, scene parsing, semantic segmentation and keypoint detection. Moreover, experimental results support the data-efficient property and excellent representation transferability of our method. The source code and trained weights are available at \url{https://github.com/visresearch/mgc}.
