Table of Contents
Fetching ...

M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios

Ning Liao, Xiaopeng Zhang, Min Cao, Junchi Yan

TL;DR

This work tackles open-set recognition in vision-language prompt learning by addressing label bias, where inputs from unseen classes are forced into known training categories. It introduces M-Tuning, which expands the prompt text space with WordNet-based open words to simulate open-set scenarios and mitigate bias, optimized in a frozen VL backbone. To scale to large datasets, it adds Combinatorial Tuning and Testing (CTT), dividing the closed-set into smaller groups, tuning per group, and selecting the best group-specific prompt at inference. The authors construct VL-based OSR baselines for fair comparison, show superior performance on both small and large datasets, and provide extensive ablations that validate each component. Overall, M-Tuning with CTT advances prompt-based OSR by enabling robust unknown detection and accurate closed-set classification without requiring knowledge of open-set names.

Abstract

In realistic open-set scenarios where labels of a part of testing data are totally unknown, when vision-language (VL) prompt learning methods encounter inputs related to unknown classes (i.e., not seen during training), they always predict them as one of the training classes. The exhibited label bias causes difficulty in open set recognition (OSR), in which an image should be correctly predicted as one of the known classes or the unknown one. To achieve this goal, we propose a vision-language prompt tuning method with mitigated label bias (M-Tuning). It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario. Besides, inspired by the observation that classifying directly on large datasets causes a much higher false positive rate than on small datasets, we propose a Combinatorial Tuning and Testing (CTT) strategy for improving performance. CTT decomposes M-Tuning on large datasets as multiple independent group-wise tuning on fewer classes, then makes accurate and comprehensive predictions by selecting the optimal sub-prompt. Finally, given the lack of VL-based OSR baselines in the literature, especially for prompt methods, we contribute new baselines for fair comparisons. Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.

M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios

TL;DR

This work tackles open-set recognition in vision-language prompt learning by addressing label bias, where inputs from unseen classes are forced into known training categories. It introduces M-Tuning, which expands the prompt text space with WordNet-based open words to simulate open-set scenarios and mitigate bias, optimized in a frozen VL backbone. To scale to large datasets, it adds Combinatorial Tuning and Testing (CTT), dividing the closed-set into smaller groups, tuning per group, and selecting the best group-specific prompt at inference. The authors construct VL-based OSR baselines for fair comparison, show superior performance on both small and large datasets, and provide extensive ablations that validate each component. Overall, M-Tuning with CTT advances prompt-based OSR by enabling robust unknown detection and accurate closed-set classification without requiring knowledge of open-set names.

Abstract

In realistic open-set scenarios where labels of a part of testing data are totally unknown, when vision-language (VL) prompt learning methods encounter inputs related to unknown classes (i.e., not seen during training), they always predict them as one of the training classes. The exhibited label bias causes difficulty in open set recognition (OSR), in which an image should be correctly predicted as one of the known classes or the unknown one. To achieve this goal, we propose a vision-language prompt tuning method with mitigated label bias (M-Tuning). It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario. Besides, inspired by the observation that classifying directly on large datasets causes a much higher false positive rate than on small datasets, we propose a Combinatorial Tuning and Testing (CTT) strategy for improving performance. CTT decomposes M-Tuning on large datasets as multiple independent group-wise tuning on fewer classes, then makes accurate and comprehensive predictions by selecting the optimal sub-prompt. Finally, given the lack of VL-based OSR baselines in the literature, especially for prompt methods, we contribute new baselines for fair comparisons. Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
Paper Structure (28 sections, 10 equations, 6 figures, 14 tables)

This paper contains 28 sections, 10 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Small datasets with fewer classes. Current prompt tuning methods in (a) only involve known classes, causing the label bias in (b), which shows the distributions of the closed-set maximum probability overlaps significantly between open-set and closed-set data. M-Tuning in (c) introduces open words to extend the range of words forming texts. The results after mitigating the label bias in (d) show the closed-set and open-set data are clearly separated. Large datasets with more classes. In (f) and (g), 'IN' is the abbreviation of ImageNet. Prompts represented by different colors are mutual independent. When applying M-Tuning on large datasets directly as in (c), the closed-set accuracy in (f) is low. Combining the CTT strategy in (e), M-Tuning is performed on the decomposed groups with fewer classes, contributing to higher closed-set accuracy in (g).
  • Figure 2: Left. The framework of the proposed M-Tuning. To simulate the open-set scenarios, we introduce open words from WordNet into prompt tuning. The open words are irrelevant to the downstream training and testing classes. Each image could be predicted from the closed-set classes and open words, rather than only the closed-set classes. Right. From the comparison of the distributions of the maximum probabilities on closed-set classes, the label bias of prompt learning is mitigated by M-Tuning.
  • Figure 3: The framework of the proposed Combinatorial Tuning and Testing (CTT) strategy. The datasets are divided into several independent groups by categories. Each group is devised with a set of group-specific prompts. M-Tuning is performed on each group without mutual effect. After tuning, each testing image is predicted by all prompts. By focusing on the group-wise close-set predicted probabilities, the optimal prompt is selected for the final prediction. As a special case, the number of groups is set to 1 for small-scale datasets.
  • Figure 4: Comparisons of the detailed recognition results measured by mF1-score. Ours(All) and Ours(Few) denote the results achieved by our method under the all-data and 16-shot settings, respectively. The results of the compared methods (CLIP+ZSL radford2021learning, CoOp zhou2022learning, CoCoOp zhou2022conditional, ARPL chen2021adversarial, CE+ vaze2022open) are all self-implemented to build a new baseline for the lack of baselines of VL-based OSR performance.
  • Figure 5: The comparisons of distributions of the closed-set maximum probabilities predicted from both the closed-set and open-set data between introducing and not introducing open words into the proposed M-Tuning in the unknown detection experiments.
  • ...and 1 more figures