Table of Contents
Fetching ...

Determined Multi-Label Learning via Similarity-Based Prompt

Meng Wei, Zhongnian Li, Peng Ying, Yong Zhou, Xinzheng Xu

TL;DR

This work addresses the high cost of annotating full multi-label annotations by introducing Determined Multi-Label Learning (DMLL), where each training example is paired with a binary determined label for a randomly chosen candidate class. It develops a risk-consistent estimator that reweights the loss using $p(y^{\gamma}=1|x)$ and $p(y^{\gamma}=0|x)$, enabling learning from determined-labeled data with a binary cross-entropy loss framework; when these probabilities are unknown, they are estimated via the RAM-based image model and sigmoid outputs. To further improve performance, the paper introduces a similarity-based prompt (SBP) mechanism that augments each target label with semantically similar labels drawn from a large RAM label space, using the CLIP text encoder to compute similarities and optimize an optimal prompt $P^*$. Empirical results on VOC, COCO, NUS, and CUB show that DMLL consistently outperforms state-of-the-art weakly supervised approaches across key metrics (MAP, ranking loss, one error, and coverage), validating both the theoretical guarantees and the practical value of SBP in large-scale vision-language settings.

Abstract

In multi-label classification, each training instance is associated with multiple class labels simultaneously. Unfortunately, collecting the fully precise class labels for each training instance is time- and labor-consuming for real-world applications. To alleviate this problem, a novel labeling setting termed \textit{Determined Multi-Label Learning} (DMLL) is proposed, aiming to effectively alleviate the labeling cost inherent in multi-label tasks. In this novel labeling setting, each training instance is associated with a \textit{determined label} (either "Yes" or "No"), which indicates whether the training instance contains the provided class label. The provided class label is randomly and uniformly selected from the whole candidate labels set. Besides, each training instance only need to be determined once, which significantly reduce the annotation cost of the labeling task for multi-label datasets. In this paper, we theoretically derive an risk-consistent estimator to learn a multi-label classifier from these determined-labeled training data. Additionally, we introduce a similarity-based prompt learning method for the first time, which minimizes the risk-consistent loss of large-scale pre-trained models to learn a supplemental prompt with richer semantic information. Extensive experimental validation underscores the efficacy of our approach, demonstrating superior performance compared to existing state-of-the-art methods.

Determined Multi-Label Learning via Similarity-Based Prompt

TL;DR

This work addresses the high cost of annotating full multi-label annotations by introducing Determined Multi-Label Learning (DMLL), where each training example is paired with a binary determined label for a randomly chosen candidate class. It develops a risk-consistent estimator that reweights the loss using and , enabling learning from determined-labeled data with a binary cross-entropy loss framework; when these probabilities are unknown, they are estimated via the RAM-based image model and sigmoid outputs. To further improve performance, the paper introduces a similarity-based prompt (SBP) mechanism that augments each target label with semantically similar labels drawn from a large RAM label space, using the CLIP text encoder to compute similarities and optimize an optimal prompt . Empirical results on VOC, COCO, NUS, and CUB show that DMLL consistently outperforms state-of-the-art weakly supervised approaches across key metrics (MAP, ranking loss, one error, and coverage), validating both the theoretical guarantees and the practical value of SBP in large-scale vision-language settings.

Abstract

In multi-label classification, each training instance is associated with multiple class labels simultaneously. Unfortunately, collecting the fully precise class labels for each training instance is time- and labor-consuming for real-world applications. To alleviate this problem, a novel labeling setting termed \textit{Determined Multi-Label Learning} (DMLL) is proposed, aiming to effectively alleviate the labeling cost inherent in multi-label tasks. In this novel labeling setting, each training instance is associated with a \textit{determined label} (either "Yes" or "No"), which indicates whether the training instance contains the provided class label. The provided class label is randomly and uniformly selected from the whole candidate labels set. Besides, each training instance only need to be determined once, which significantly reduce the annotation cost of the labeling task for multi-label datasets. In this paper, we theoretically derive an risk-consistent estimator to learn a multi-label classifier from these determined-labeled training data. Additionally, we introduce a similarity-based prompt learning method for the first time, which minimizes the risk-consistent loss of large-scale pre-trained models to learn a supplemental prompt with richer semantic information. Extensive experimental validation underscores the efficacy of our approach, demonstrating superior performance compared to existing state-of-the-art methods.
Paper Structure (21 sections, 11 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 11 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: An example of determined multi-label learning. a) A image with "chair" label; b) A image without "plane" label. Compared to precisely annotating all the relevant labels, determining the presence of a randomly generated label in the image is undoubtedly easier and more time-efficient. This labeling procedure requires only a single assessment per image, making it suitable for labeling tasks on the future real-world large-scale datasets.
  • Figure 2: The architecture of the proposed model. We design a similarity-based prompt (SBP) strategy to enhance the inherent semantic information of labels. First, we offline generate a similar labels list for the target labels set based on the large labels set organized by RAM. Second, these similar labels are embedded into a the proposed SBP. Subsequently, we utilize the RAM image encoder and CLIP text encoder to extract image and text features. Finally, we optimize the proposed risk-consistent loss based on the RAM model output and the linear classifier output, and iteratively update the similar labels list.
  • Figure 3: Comparison results with RAM tag list in terms of MAP (the greater, the better), one error (the smaller, the better), ranking loss (the smaller, the better), and coverage (the smaller, the better).
  • Figure 4: Comparison results with open vocabulary in terms of MAP (the greater, the better), one error (the smaller, the better), ranking loss (the smaller, the better), and coverage (the smaller, the better).