Table of Contents
Fetching ...

Multi-Label Test-Time Adaptation with Bound Entropy Minimization

Xiangyu Wu, Feng Yu, Qing-Guo Chen, Yang Yang, Jianfeng Lu

TL;DR

This work tackles the challenge of test-time adaptation for multi-label image understanding by introducing Bound Entropy Minimization (BEM), which intentionally boosts the probabilities of the top-k predicted labels rather than only the top-1. By treating paired captions as strong-label pseudo-views and aligning view-caption prompts within CLIP's multimodal space, the Multi-Label Test-Time Adaptation (ML--TTA) framework binds the top-k predictions across both views and captions, enabling simultaneous optimization. The approach, evaluated on MSCOCO, VOC, and NUSWIDE across multiple CLIP backbones and prompt initializations, consistently outperforms the baseline CLIP model and existing SOTA TTA methods in multi-label settings, with ablations highlighting the importance of BEM and caption-based guidance. The method offers a practical, training-free adaptation route for real-world multi-label tasks, leveraging textual descriptions to enhance visual adaptation without requiring source-domain data during testing.

Abstract

Mainstream test-time adaptation (TTA) techniques endeavor to mitigate distribution shifts via entropy minimization for multi-class classification, inherently increasing the probability of the most confident class. However, when encountering multi-label instances, the primary challenge stems from the varying number of labels per image, and prioritizing only the highest probability class inevitably undermines the adaptation of other positive labels. To address this issue, we investigate TTA within multi-label scenario (ML--TTA), developing Bound Entropy Minimization (BEM) objective to simultaneously increase the confidence of multiple top predicted labels. Specifically, to determine the number of labels for each augmented view, we retrieve a paired caption with yielded textual labels for that view. These labels are allocated to both the view and caption, called weak label set and strong label set with the same size k. Following this, the proposed BEM considers the highest top-k predicted labels from view and caption as a single entity, respectively, learning both view and caption prompts concurrently. By binding top-k predicted labels, BEM overcomes the limitation of vanilla entropy minimization, which exclusively optimizes the most confident class. Across the MSCOCO, VOC, and NUSWIDE multi-label datasets, our ML--TTA framework equipped with BEM exhibits superior performance compared to the latest SOTA methods, across various model architectures, prompt initialization, and varying label scenarios. The code is available at https://github.com/Jinx630/ML-TTA.

Multi-Label Test-Time Adaptation with Bound Entropy Minimization

TL;DR

This work tackles the challenge of test-time adaptation for multi-label image understanding by introducing Bound Entropy Minimization (BEM), which intentionally boosts the probabilities of the top-k predicted labels rather than only the top-1. By treating paired captions as strong-label pseudo-views and aligning view-caption prompts within CLIP's multimodal space, the Multi-Label Test-Time Adaptation (ML--TTA) framework binds the top-k predictions across both views and captions, enabling simultaneous optimization. The approach, evaluated on MSCOCO, VOC, and NUSWIDE across multiple CLIP backbones and prompt initializations, consistently outperforms the baseline CLIP model and existing SOTA TTA methods in multi-label settings, with ablations highlighting the importance of BEM and caption-based guidance. The method offers a practical, training-free adaptation route for real-world multi-label tasks, leveraging textual descriptions to enhance visual adaptation without requiring source-domain data during testing.

Abstract

Mainstream test-time adaptation (TTA) techniques endeavor to mitigate distribution shifts via entropy minimization for multi-class classification, inherently increasing the probability of the most confident class. However, when encountering multi-label instances, the primary challenge stems from the varying number of labels per image, and prioritizing only the highest probability class inevitably undermines the adaptation of other positive labels. To address this issue, we investigate TTA within multi-label scenario (ML--TTA), developing Bound Entropy Minimization (BEM) objective to simultaneously increase the confidence of multiple top predicted labels. Specifically, to determine the number of labels for each augmented view, we retrieve a paired caption with yielded textual labels for that view. These labels are allocated to both the view and caption, called weak label set and strong label set with the same size k. Following this, the proposed BEM considers the highest top-k predicted labels from view and caption as a single entity, respectively, learning both view and caption prompts concurrently. By binding top-k predicted labels, BEM overcomes the limitation of vanilla entropy minimization, which exclusively optimizes the most confident class. Across the MSCOCO, VOC, and NUSWIDE multi-label datasets, our ML--TTA framework equipped with BEM exhibits superior performance compared to the latest SOTA methods, across various model architectures, prompt initialization, and varying label scenarios. The code is available at https://github.com/Jinx630/ML-TTA.

Paper Structure

This paper contains 22 sections, 2 theorems, 19 equations, 4 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Consider the output logits of a confident view $x$, denoted as $\mathbf{s} = (s_1, s_2, \dots, s_L)$, where, without loss of generality, we assume $s_1 > s_2 > \dots > s_L$. It can be deduced that the entropy loss $H = H(p(\cdot|x))$ decreases as $s_1$ increases, and $H$ increases as the sum of the

Figures (4)

  • Figure 1: (a). Compared to CLIP VLMs-Openai-CLIP, ML--TTA increases all positive label logits simultaneously, while others focus only on top-1 class. (b). Comparison of various methods on images with varying numbers. Compared to CLIP, as the number of labels per image rises, the adaptability of TPT TTA-TPT and RLCF TTA-RLCF in handling multi-label images shows a marked decrease.
  • Figure 2: Overview of proposed multi-label test-time adaption.
  • Figure 3: Results on different number of views.
  • Figure 6: Comparison with binary cross-entropy loss.

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2