Table of Contents
Fetching ...

Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

Hao Tan, Zichang Tan, Jun Li, Ajian Liu, Jun Wan, Zhen Lei

TL;DR

Open-vocabulary multi-label recognition is hindered by degraded local semantics in CLIP-based encoders and by weak region-to-label matching. RAM addresses these issues with Ladder Local Adapter (LLA) to restore local context and Knowledge-Constrained Optimal Transport (KCOT) to jointly and constraint-guidedly match image regions to labels, guided by Label Presence Detection and Teacher Knowledge Transfer. The framework uses a learnable label prompt set and a contrastive Multi-Matching (MMC) loss, achieving state-of-the-art results across six OV benchmarks spanning natural, pedestrian, and remote-sensing domains. RAM demonstrates strong generalization to unseen labels with efficient memory and inference cost, underscoring the practical value of enforcing locality and principled set matching in OVMLR.

Abstract

Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce refocusing on local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on various datasets from three distinct domains, and shows great potential to boost the existing methods. Code: https://github.com/EricTan7/RAM.

Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

TL;DR

Open-vocabulary multi-label recognition is hindered by degraded local semantics in CLIP-based encoders and by weak region-to-label matching. RAM addresses these issues with Ladder Local Adapter (LLA) to restore local context and Knowledge-Constrained Optimal Transport (KCOT) to jointly and constraint-guidedly match image regions to labels, guided by Label Presence Detection and Teacher Knowledge Transfer. The framework uses a learnable label prompt set and a contrastive Multi-Matching (MMC) loss, achieving state-of-the-art results across six OV benchmarks spanning natural, pedestrian, and remote-sensing domains. RAM demonstrates strong generalization to unseen labels with efficient memory and inference cost, underscoring the practical value of enforcing locality and principled set matching in OVMLR.

Abstract

Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce refocusing on local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on various datasets from three distinct domains, and shows great potential to boost the existing methods. Code: https://github.com/EricTan7/RAM.

Paper Structure

This paper contains 34 sections, 22 equations, 14 figures, 12 tables, 1 algorithm.

Figures (14)

  • Figure 1: Visualizations and performance comparisons on proposed LLA and KCOT. (a) Effectiveness of LLA: CLIP exhibits poor localization capabilities, upon which the optimal transport shows poor results. Our LLA effectively recovers locality, yielding precise matching. (b) Effectiveness of KCOT: we compare the matching between our KCOT and re-weighting in sun2022dualcoop w.r.t. multiple labels. The ground-truth (GT) labels are marked in red boxes. Re-weighting exhibits noisy matching while KCOT precisely focuses on GT labels. (c) Performance: both LLA and KCOT bring notable improvements across different datasets.
  • Figure 2: Overview of the proposed RAM framework. LLA is applied to recover the local semantics of the image encoder. KCOT is applied between local image features and text features to find a region-to-label matching. In KCOT, we propose LPD to distribute unbalanced constraint to visual set, implicitly highlighting foreground areas. Moreover, TKT encourages knowledge alignments, and is only performed during training. RAM is trained under contrastive objective $\mathcal{L}_{MMC}$. The global feature is omitted for clarity.
  • Figure 3: The proposed Ladder Local Adapter (LLA). There are no connections back to the original image encoder (i.e., ladder side structure), where the gradient is only propagated within LLA, enabling efficient transfer. The outputs from SAA and TSS are averaged. LLA is only applied in the last few layers.
  • Figure 4: The comparison of attention maps between original self-attention (left) and our SAA (right). Original attention is overwhelmed by dominant patch, while SAA produces diagonal-style attention maps (see Appendix \ref{['app:dis_saa']} for more discussions).
  • Figure 5: The illustration of KCOT process. LPD delivers weights constraint on image regions. TKT simply modifies the cost.
  • ...and 9 more figures