ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction
Juan Yeo, Soonwoo Cha, Jiwoo Song, Hyunbin Jin, Taesup Kim
TL;DR
This work tackles the problem of enabling open-vocabulary dense prediction with CLIP-based models by focusing on two core deficits: semantic coherence of local features and fine-grained vision-language alignment at the patch level. It introduces Any-to-Any Self-Distillation (ATAS), a unified self-distillation framework that simultaneously enhances patch-level alignment and preserves semantic coherence through Global-to-Local Distillation, Local-to-Local Distillation, and Global-to-Global Distillation. ATAS achieves substantial improvements on open-vocabulary semantic segmentation and object detection benchmarks, including zero-shot and fine-tuning setups, by using only unlabeled images and mosaic augmentation to guide cross-level knowledge transfer. The results demonstrate that maintaining both coherence and alignment is crucial for dense prediction, with ATAS consistently outperforming baseline CLIP and prior adaptation methods across diverse datasets and tasks, highlighting its practical potential for open-vocabulary vision systems.
Abstract
Vision-language models such as CLIP have recently propelled open-vocabulary dense prediction tasks by enabling recognition of a broad range of visual concepts. However, CLIP still struggles with fine-grained, region-level understanding, hindering its effectiveness on these dense prediction tasks. We identify two pivotal factors required to address this limitation: semantic coherence and fine-grained vision-language alignment. Current adaptation methods often improve fine-grained alignment at the expense of semantic coherence, and often rely on extra modules or supervised fine-tuning. To overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel approach that simultaneously enhances semantic coherence and fine-grained alignment by leveraging own knowledge of a model across all representation levels. Unlike prior methods, ATAS uses only unlabeled images and an internal self-distillation process to refine representations of CLIP vision encoders, preserving local semantic consistency while sharpening local detail recognition. On open-vocabulary object detection and semantic segmentation benchmarks, ATAS achieves substantial performance gains, outperforming baseline CLIP models. These results validate the effectiveness of our approach and underscore the importance of jointly maintaining semantic coherence and fine-grained alignment for advanced open-vocabulary dense prediction.
