Table of Contents
Fetching ...

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

Bowen Shi, Peisen Zhao, Zichen Wang, Yuhang Zhang, Yaoming Wang, Jin Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian, Xiaopeng Zhang

TL;DR

A Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities across different levels of detail and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks.

Abstract

Vision-language foundation models, represented by Contrastive Language-Image Pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level as well as pixel-level captions and tags. Accordingly, we develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities across different levels of detail. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We believe that UMG-CLIP represents a valuable advancement in vision-language foundation models. The code is available at https://github.com/lygsbw/UMG-CLIP.

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

TL;DR

A Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities across different levels of detail and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks.

Abstract

Vision-language foundation models, represented by Contrastive Language-Image Pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level as well as pixel-level captions and tags. Accordingly, we develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities across different levels of detail. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We believe that UMG-CLIP represents a valuable advancement in vision-language foundation models. The code is available at https://github.com/lygsbw/UMG-CLIP.
Paper Structure (17 sections, 4 equations, 7 figures, 14 tables)

This paper contains 17 sections, 4 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Compared to existing models, UMG-CLIP demonstrates outstanding performance across a wide range of tasks.
  • Figure 2: The automated annotation workflow, which generates captions/tags at image-level, region-level, and pixel-level.
  • Figure 3: Visualization of annotated image in UMG-41M. It includes tag and caption annotations at both image-level and pixel-level, as well as masks for both foreground and background.
  • Figure 4: UMG-CLIP is a multi-granularity multi-task framework that aligns both image-level and region-level visual features with their corresponding tags and captions. This alignment empowers the model with generalist capabilities across multiple granularity, allowing it to efficiently adapt to various downstream tasks through PET.
  • Figure 5: The alignment between up-sampled visual tokens and texts ("A photo of [tag]") using EVA-CLIP-L/14 and UMG-CLIP-L/14, respectively. "[tag]" corresponds to the class labels of the main objects in the images.
  • ...and 2 more figures