Table of Contents
Fetching ...

ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction

Juan Yeo, Soonwoo Cha, Jiwoo Song, Hyunbin Jin, Taesup Kim

TL;DR

This work tackles the problem of enabling open-vocabulary dense prediction with CLIP-based models by focusing on two core deficits: semantic coherence of local features and fine-grained vision-language alignment at the patch level. It introduces Any-to-Any Self-Distillation (ATAS), a unified self-distillation framework that simultaneously enhances patch-level alignment and preserves semantic coherence through Global-to-Local Distillation, Local-to-Local Distillation, and Global-to-Global Distillation. ATAS achieves substantial improvements on open-vocabulary semantic segmentation and object detection benchmarks, including zero-shot and fine-tuning setups, by using only unlabeled images and mosaic augmentation to guide cross-level knowledge transfer. The results demonstrate that maintaining both coherence and alignment is crucial for dense prediction, with ATAS consistently outperforming baseline CLIP and prior adaptation methods across diverse datasets and tasks, highlighting its practical potential for open-vocabulary vision systems.

Abstract

Vision-language models such as CLIP have recently propelled open-vocabulary dense prediction tasks by enabling recognition of a broad range of visual concepts. However, CLIP still struggles with fine-grained, region-level understanding, hindering its effectiveness on these dense prediction tasks. We identify two pivotal factors required to address this limitation: semantic coherence and fine-grained vision-language alignment. Current adaptation methods often improve fine-grained alignment at the expense of semantic coherence, and often rely on extra modules or supervised fine-tuning. To overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel approach that simultaneously enhances semantic coherence and fine-grained alignment by leveraging own knowledge of a model across all representation levels. Unlike prior methods, ATAS uses only unlabeled images and an internal self-distillation process to refine representations of CLIP vision encoders, preserving local semantic consistency while sharpening local detail recognition. On open-vocabulary object detection and semantic segmentation benchmarks, ATAS achieves substantial performance gains, outperforming baseline CLIP models. These results validate the effectiveness of our approach and underscore the importance of jointly maintaining semantic coherence and fine-grained alignment for advanced open-vocabulary dense prediction.

ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction

TL;DR

This work tackles the problem of enabling open-vocabulary dense prediction with CLIP-based models by focusing on two core deficits: semantic coherence of local features and fine-grained vision-language alignment at the patch level. It introduces Any-to-Any Self-Distillation (ATAS), a unified self-distillation framework that simultaneously enhances patch-level alignment and preserves semantic coherence through Global-to-Local Distillation, Local-to-Local Distillation, and Global-to-Global Distillation. ATAS achieves substantial improvements on open-vocabulary semantic segmentation and object detection benchmarks, including zero-shot and fine-tuning setups, by using only unlabeled images and mosaic augmentation to guide cross-level knowledge transfer. The results demonstrate that maintaining both coherence and alignment is crucial for dense prediction, with ATAS consistently outperforming baseline CLIP and prior adaptation methods across diverse datasets and tasks, highlighting its practical potential for open-vocabulary vision systems.

Abstract

Vision-language models such as CLIP have recently propelled open-vocabulary dense prediction tasks by enabling recognition of a broad range of visual concepts. However, CLIP still struggles with fine-grained, region-level understanding, hindering its effectiveness on these dense prediction tasks. We identify two pivotal factors required to address this limitation: semantic coherence and fine-grained vision-language alignment. Current adaptation methods often improve fine-grained alignment at the expense of semantic coherence, and often rely on extra modules or supervised fine-tuning. To overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel approach that simultaneously enhances semantic coherence and fine-grained alignment by leveraging own knowledge of a model across all representation levels. Unlike prior methods, ATAS uses only unlabeled images and an internal self-distillation process to refine representations of CLIP vision encoders, preserving local semantic consistency while sharpening local detail recognition. On open-vocabulary object detection and semantic segmentation benchmarks, ATAS achieves substantial performance gains, outperforming baseline CLIP models. These results validate the effectiveness of our approach and underscore the importance of jointly maintaining semantic coherence and fine-grained alignment for advanced open-vocabulary dense prediction.

Paper Structure

This paper contains 30 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overall results of our proposed method. (a) Patch similarity maps illustrate the cosine similarity between a specific patch (marked with a red '$\times$') and other patches, revealing semantic coherence. (b) Downstream dense prediction task performances present relative gains (%) compared to original CLIP, averaged across datasets.
  • Figure 3: Overview of ATAS framework. ATAS is built upon three core components: (1) Global-to-Local Distillation (2) Global-to-Global Distillation and (3) Local-to-Local Distillation. We utilize patch token embeddings from mosaic-augmented images as local representations and CLS token embeddings from individual images as global representations.
  • Figure 4: Qualitative results on zero-shot semantic segmentation. We use images from PASCAL VOC and show their segmentations for CLIP, CLIPSelf and ATAS.
  • Figure : (a) Semantic Coherence
  • Figure : (a) Semantic Coherence
  • ...and 1 more figures