Table of Contents
Fetching ...

Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition

Haijing Liu, Tao Pu, Hefeng Wu, Keze Wang, Liang Lin

TL;DR

This work tackles open-vocabulary multi-label recognition by introducing C2SRT, a category-adaptive framework that jointly refines intra-category semantics and transfers inter-category knowledge. It builds on a CLIP-based vision-language backbone with a distillation loss to preserve generalization, employs ISR to adaptively select semantically relevant local patches guided by category text, and uses IST with LLM-driven category graphs and GATv2 to propagate knowledge from seen to unseen labels. The approach yields consistent improvements over prior OV-MLR methods on NUS-WIDE and Open Images in both ZSL and GZSL settings, with strong ablations confirming the complementary benefits of distillation, ISR, and IST. The combination of cross-modal intra- and inter-category refinement offers a practical and scalable path for robust open-vocabulary recognition in multi-label scenarios.

Abstract

Benefiting from the generalization capability of CLIP, recent vision language pre-training (VLP) models have demonstrated an impressive ability to capture virtually any visual concept in daily images. However, due to the presence of unseen categories in open-vocabulary settings, existing algorithms struggle to effectively capture strong semantic correlations between categories, resulting in sub-optimal performance on the open-vocabulary multi-label recognition (OV-MLR). Furthermore, the substantial variation in the number of discriminative areas across diverse object categories is misaligned with the fixed-number patch matching used in current methods, introducing noisy visual cues that hinder the accurate capture of target semantics. To tackle these challenges, we propose a novel category-adaptive cross-modal semantic refinement and transfer (C$^2$SRT) framework to explore the semantic correlation both within each category and across different categories, in a category-adaptive manner. The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module. Specifically, the ISR module leverages the cross-modal knowledge of the VLP model to adaptively find a set of local discriminative regions that best represent the semantics of the target category. The IST module adaptively discovers a set of most correlated categories for a target category by utilizing the commonsense capabilities of LLMs to construct a category-adaptive correlation graph and transfers semantic knowledge from the correlated seen categories to unseen ones. Extensive experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$^2$SRT framework outperforms current state-of-the-art algorithms.

Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition

TL;DR

This work tackles open-vocabulary multi-label recognition by introducing C2SRT, a category-adaptive framework that jointly refines intra-category semantics and transfers inter-category knowledge. It builds on a CLIP-based vision-language backbone with a distillation loss to preserve generalization, employs ISR to adaptively select semantically relevant local patches guided by category text, and uses IST with LLM-driven category graphs and GATv2 to propagate knowledge from seen to unseen labels. The approach yields consistent improvements over prior OV-MLR methods on NUS-WIDE and Open Images in both ZSL and GZSL settings, with strong ablations confirming the complementary benefits of distillation, ISR, and IST. The combination of cross-modal intra- and inter-category refinement offers a practical and scalable path for robust open-vocabulary recognition in multi-label scenarios.

Abstract

Benefiting from the generalization capability of CLIP, recent vision language pre-training (VLP) models have demonstrated an impressive ability to capture virtually any visual concept in daily images. However, due to the presence of unseen categories in open-vocabulary settings, existing algorithms struggle to effectively capture strong semantic correlations between categories, resulting in sub-optimal performance on the open-vocabulary multi-label recognition (OV-MLR). Furthermore, the substantial variation in the number of discriminative areas across diverse object categories is misaligned with the fixed-number patch matching used in current methods, introducing noisy visual cues that hinder the accurate capture of target semantics. To tackle these challenges, we propose a novel category-adaptive cross-modal semantic refinement and transfer (CSRT) framework to explore the semantic correlation both within each category and across different categories, in a category-adaptive manner. The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module. Specifically, the ISR module leverages the cross-modal knowledge of the VLP model to adaptively find a set of local discriminative regions that best represent the semantics of the target category. The IST module adaptively discovers a set of most correlated categories for a target category by utilizing the commonsense capabilities of LLMs to construct a category-adaptive correlation graph and transfers semantic knowledge from the correlated seen categories to unseen ones. Extensive experiments on OV-MLR benchmarks clearly demonstrate that the proposed CSRT framework outperforms current state-of-the-art algorithms.

Paper Structure

This paper contains 27 sections, 23 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Architectural differences between (a) traditional multi-label recognition methods and (b) open-vocabulary multi-label recognition methods. Compared with previous approaches, (c) our proposed method explores rich semantic correlation both within each category and across different categories.
  • Figure 2: Several examples of semantic correlations (a) across different categories and (b) within each category.
  • Figure 3: The overall framework of our C2SRT framework. Our C2SRT employs a learnable vision encoder, which aligns features through knowledge distillation from a fixed VLP vision encoder, to extract image features. Simultaneously, a fixed VLP text encoder extracts ensemble-based textual features. The ISR module quantifies information by calculating the intra-category semantic similarity of local patch features, selects the most informative patches, and adaptively focuses on local visual features using a threshold based on the total information. After the visual and textual features fusing, the multi-modal features is fed into the IST module, enabling adaptive inter-category knowledge transfer, with inter-category relationships are derived from LLM-driven relationship mining.
  • Figure 4: Effect of varying numbers of related categories in the IST module for (a) zero-shot learning (ZSL) and (b) generalized zero-shot learning (GZSL) tasks on the NUS-WIDE dataset.
  • Figure 5: Effect of hyper-parameter $\alpha$ in the ISR module for (a) zero-shot learning (ZSL) and (b) generalized zero-shot learning (GZSL) tasks on the NUS-WIDE dataset. Note that $\alpha=0.0$ indicates the absence of local features.
  • ...and 6 more figures