Table of Contents
Fetching ...

Pseudo-Prompt Generating in Pre-trained Vision-Language Models for Multi-Label Medical Image Classification

Yaoqin Ye, Junjie Zhang, Hongwei Shi

TL;DR

Comparative evaluations on various multi-label chest radiograph datasets affirm the superiority of the Pseudo-Prompt Generating approach against leading medical vision-language and multi-label prompt learning methods.

Abstract

The task of medical image recognition is notably complicated by the presence of varied and multiple pathological indications, presenting a unique challenge in multi-label classification with unseen labels. This complexity underlines the need for computer-aided diagnosis methods employing multi-label zero-shot learning. Recent advancements in pre-trained vision-language models (VLMs) have showcased notable zero-shot classification abilities on medical images. However, these methods have limitations on leveraging extensive pre-trained knowledge from broader image datasets, and often depend on manual prompt construction by expert radiologists. By automating the process of prompt tuning, prompt learning techniques have emerged as an efficient way to adapt VLMs to downstream tasks. Yet, existing CoOp-based strategies fall short in performing class-specific prompts on unseen categories, limiting generalizability in fine-grained scenarios. To overcome these constraints, we introduce a novel prompt generation approach inspirited by text generation in natural language processing (NLP). Our method, named Pseudo-Prompt Generating (PsPG), capitalizes on the priori knowledge of multi-modal features. Featuring a RNN-based decoder, PsPG autoregressively generates class-tailored embedding vectors, i.e., pseudo-prompts. Comparative evaluations on various multi-label chest radiograph datasets affirm the superiority of our approach against leading medical vision-language and multi-label prompt learning methods. The source code is available at https://github.com/fallingnight/PsPG

Pseudo-Prompt Generating in Pre-trained Vision-Language Models for Multi-Label Medical Image Classification

TL;DR

Comparative evaluations on various multi-label chest radiograph datasets affirm the superiority of the Pseudo-Prompt Generating approach against leading medical vision-language and multi-label prompt learning methods.

Abstract

The task of medical image recognition is notably complicated by the presence of varied and multiple pathological indications, presenting a unique challenge in multi-label classification with unseen labels. This complexity underlines the need for computer-aided diagnosis methods employing multi-label zero-shot learning. Recent advancements in pre-trained vision-language models (VLMs) have showcased notable zero-shot classification abilities on medical images. However, these methods have limitations on leveraging extensive pre-trained knowledge from broader image datasets, and often depend on manual prompt construction by expert radiologists. By automating the process of prompt tuning, prompt learning techniques have emerged as an efficient way to adapt VLMs to downstream tasks. Yet, existing CoOp-based strategies fall short in performing class-specific prompts on unseen categories, limiting generalizability in fine-grained scenarios. To overcome these constraints, we introduce a novel prompt generation approach inspirited by text generation in natural language processing (NLP). Our method, named Pseudo-Prompt Generating (PsPG), capitalizes on the priori knowledge of multi-modal features. Featuring a RNN-based decoder, PsPG autoregressively generates class-tailored embedding vectors, i.e., pseudo-prompts. Comparative evaluations on various multi-label chest radiograph datasets affirm the superiority of our approach against leading medical vision-language and multi-label prompt learning methods. The source code is available at https://github.com/fallingnight/PsPG
Paper Structure (42 sections, 9 equations, 4 figures, 8 tables)

This paper contains 42 sections, 9 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparative overview of prompt construction methods for vision-language models.(a) Contrastive pretraining methods, using expert-crafted promptsgloriaconvirt; (b) CoOp-based methods with unified learnable prefixesdualcoopcoop; (c) Our Pseudo-Prompt Generating, employing a decoder for dynamic pseudo-text generation.
  • Figure 2: Overview of the prompt learning phase of PsPG. We introduce three novel components: Prompt Decoder, Spatial Fusion, and Soft Pairwise Co-occurrence Loss.
  • Figure 3: The autoregressive process and the internal architecture of Prompt Decoder.
  • Figure 4: Detailed Statistics about our Private-CXR. The subfigure in the upper left corner lists the names of each category.