Table of Contents
Fetching ...

Logits DeConfusion with CLIP for Few-Shot Learning

Shuo Li, Fang Liu, Zehua Hao, Xinyi Wang, Lingling Li, Xu Liu, Puhua Chen, Wenping Ma

TL;DR

This work tackles inter-class confusion in CLIP-based few-shot learning by introducing Logits DeConfusion (LDC). It combines a Multi-level Adapter Fusion (MAF) that fuses multi-level image features with an Inter-Class Deconfusion (ICD) module that learns and subtracts confusion patterns from zero-shot logits, plus an Adaptive Logits Fusion (ALF) to combine corrected and refined logits. The approach is trained with multiple cross-entropy and similarity losses to prevent over-deconfusion, and it demonstrates strong improvements across 11 classification benchmarks and robustness to out-of-distribution data, while revealing ablation-driven insights into module contributions. The results indicate that leveraging learned inter-class confusion patterns and multi-level feature fusion significantly enhances CLIP-based few-shot performance, offering a practical path to more reliable cross-domain visual understanding.

Abstract

With its powerful visual-language alignment capability, CLIP performs well in zero-shot and few-shot learning tasks. However, we found in experiments that CLIP's logits suffer from serious inter-class confusion problems in downstream tasks, and the ambiguity between categories seriously affects the accuracy. To address this challenge, we propose a novel method called Logits DeConfusion, which effectively learns and eliminates inter-class confusion in logits by combining our Multi-level Adapter Fusion (MAF) module with our Inter-Class Deconfusion (ICD) module. Our MAF extracts features from different levels and fuses them uniformly to enhance feature representation. Our ICD learnably eliminates inter-class confusion in logits with a residual structure. Experimental results show that our method can significantly improve the classification performance and alleviate the inter-class confusion problem. The code is available at https://github.com/LiShuo1001/LDC.

Logits DeConfusion with CLIP for Few-Shot Learning

TL;DR

This work tackles inter-class confusion in CLIP-based few-shot learning by introducing Logits DeConfusion (LDC). It combines a Multi-level Adapter Fusion (MAF) that fuses multi-level image features with an Inter-Class Deconfusion (ICD) module that learns and subtracts confusion patterns from zero-shot logits, plus an Adaptive Logits Fusion (ALF) to combine corrected and refined logits. The approach is trained with multiple cross-entropy and similarity losses to prevent over-deconfusion, and it demonstrates strong improvements across 11 classification benchmarks and robustness to out-of-distribution data, while revealing ablation-driven insights into module contributions. The results indicate that leveraging learned inter-class confusion patterns and multi-level feature fusion significantly enhances CLIP-based few-shot performance, offering a practical path to more reliable cross-domain visual understanding.

Abstract

With its powerful visual-language alignment capability, CLIP performs well in zero-shot and few-shot learning tasks. However, we found in experiments that CLIP's logits suffer from serious inter-class confusion problems in downstream tasks, and the ambiguity between categories seriously affects the accuracy. To address this challenge, we propose a novel method called Logits DeConfusion, which effectively learns and eliminates inter-class confusion in logits by combining our Multi-level Adapter Fusion (MAF) module with our Inter-Class Deconfusion (ICD) module. Our MAF extracts features from different levels and fuses them uniformly to enhance feature representation. Our ICD learnably eliminates inter-class confusion in logits with a residual structure. Experimental results show that our method can significantly improve the classification performance and alleviate the inter-class confusion problem. The code is available at https://github.com/LiShuo1001/LDC.

Paper Structure

This paper contains 25 sections, 14 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: (a) Inter-class confusion of logits in CLIP-based ZSL. (b) After remove inter-class confusion of logits. (c) Our Logits DeConfusion models inter-class confusion and removes it.
  • Figure 2: Overall architecture of our LDC. Our method consists of four main modules, namely Zero-Shot CLIP (ZS-CLIP), Inter-Class Deconfusion (ICD), Multi-level Adapter Fusion (MAF), and Adaptive Logits Fusion (ALF). In addition, our method includes three cross-entropy losses and two similarity losses for optimizing the learnable parameters. In ALF, the $\alpha$ Generator generates an adaptive weight $\alpha$ used to fuse the logits $s_i^{MAF}$ and $s_i^{ICD}$. All learnable parameters are in ICD, MAF, and $\alpha$ Generator. MAF is detailed in Section \ref{['sec_maf']}.
  • Figure 3: Details of MAF. On the left is the image encoder $\mathcal{E}_I$.
  • Figure 4: Details of our Fusion mechanisms.
  • Figure 5: Classification performance of different methods on 11 datasets, and the last one is the average performance on these 11 datasets.