Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition
Haijing Liu, Tao Pu, Hefeng Wu, Keze Wang, Liang Lin
TL;DR
This work tackles open-vocabulary multi-label recognition by introducing C2SRT, a category-adaptive framework that jointly refines intra-category semantics and transfers inter-category knowledge. It builds on a CLIP-based vision-language backbone with a distillation loss to preserve generalization, employs ISR to adaptively select semantically relevant local patches guided by category text, and uses IST with LLM-driven category graphs and GATv2 to propagate knowledge from seen to unseen labels. The approach yields consistent improvements over prior OV-MLR methods on NUS-WIDE and Open Images in both ZSL and GZSL settings, with strong ablations confirming the complementary benefits of distillation, ISR, and IST. The combination of cross-modal intra- and inter-category refinement offers a practical and scalable path for robust open-vocabulary recognition in multi-label scenarios.
Abstract
Benefiting from the generalization capability of CLIP, recent vision language pre-training (VLP) models have demonstrated an impressive ability to capture virtually any visual concept in daily images. However, due to the presence of unseen categories in open-vocabulary settings, existing algorithms struggle to effectively capture strong semantic correlations between categories, resulting in sub-optimal performance on the open-vocabulary multi-label recognition (OV-MLR). Furthermore, the substantial variation in the number of discriminative areas across diverse object categories is misaligned with the fixed-number patch matching used in current methods, introducing noisy visual cues that hinder the accurate capture of target semantics. To tackle these challenges, we propose a novel category-adaptive cross-modal semantic refinement and transfer (C$^2$SRT) framework to explore the semantic correlation both within each category and across different categories, in a category-adaptive manner. The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module. Specifically, the ISR module leverages the cross-modal knowledge of the VLP model to adaptively find a set of local discriminative regions that best represent the semantics of the target category. The IST module adaptively discovers a set of most correlated categories for a target category by utilizing the commonsense capabilities of LLMs to construct a category-adaptive correlation graph and transfers semantic knowledge from the correlated seen categories to unseen ones. Extensive experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$^2$SRT framework outperforms current state-of-the-art algorithms.
