KPL: Training-Free Medical Knowledge Mining of Vision-Language Models

Jiaxiang Liu; Tianxiang Hu; Jiawei Du; Ruiyuan Zhang; Joey Tianyi Zhou; Zuozhu Liu

KPL: Training-Free Medical Knowledge Mining of Vision-Language Models

Jiaxiang Liu, Tianxiang Hu, Jiawei Du, Ruiyuan Zhang, Joey Tianyi Zhou, Zuozhu Liu

TL;DR

The Knowledge Proxy Learning (KPL) is designed to leverage CLIP's multimodal understandings for medical image classification through Text Proxy Optimization and Multimodal Proxy Learning, and enables effective zero-shot image classification, outperforming all baselines.

Abstract

Visual Language Models such as CLIP excel in image recognition due to extensive image-text pre-training. However, applying the CLIP inference in zero-shot classification, particularly for medical image diagnosis, faces challenges due to: 1) the inadequacy of representing image classes solely with single category names; 2) the modal gap between the visual and text spaces generated by CLIP encoders. Despite attempts to enrich disease descriptions with large language models, the lack of class-specific knowledge often leads to poor performance. In addition, empirical evidence suggests that existing proxy learning methods for zero-shot image classification on natural image datasets exhibit instability when applied to medical datasets. To tackle these challenges, we introduce the Knowledge Proxy Learning (KPL) to mine knowledge from CLIP. KPL is designed to leverage CLIP's multimodal understandings for medical image classification through Text Proxy Optimization and Multimodal Proxy Learning. Specifically, KPL retrieves image-relevant knowledge descriptions from the constructed knowledge-enhanced base to enrich semantic text proxies. It then harnesses input images and these descriptions, encoded via CLIP, to stably generate multimodal proxies that boost the zero-shot classification performance. Extensive experiments conducted on both medical and natural image datasets demonstrate that KPL enables effective zero-shot image classification, outperforming all baselines. These findings highlight the great potential in this paradigm of mining knowledge from CLIP for medical image classification and broader areas.

KPL: Training-Free Medical Knowledge Mining of Vision-Language Models

TL;DR

Abstract

Paper Structure (12 sections, 1 theorem, 9 equations, 6 figures, 4 tables)

This paper contains 12 sections, 1 theorem, 9 equations, 6 figures, 4 tables.

Introduction
Preliminaries and Related Work
Methodology
Overview
Text Proxy Optimization
Multimodal Proxy Learning
Experiments
Experimental Setup
Main Results
Ablation Studies
Conclusion
Acknowledgments

Key Result

Proposition 1

The optimization problem in Eq4 has a unique solution of the form where $diag(u), diag(v)$ are two diagonal matrices with diagonals taken from vectors $u, v$, and $A = e^{\frac{M}{\tau}}$. See proof in Appendix.

Figures (6)

Figure 1: VCD utilizes a limited number of descriptions that may be irrelevant to the images, whereas KPL generates richer descriptions from the knowledge of LLMs. These are visually-based and retrieved to ensure the descriptions' relevance to the images (Text Proxy Optimization).
Figure 2: KPL is designed to leverage CLIP's capabilities for medical image classification through Text Proxy Optimization and Multimodal Proxy Learning. First, semantic information is enriched for the cataract category through Text Proxy Optimization. Then, Multimodal Proxy Learning generates a multimodal proxy combining textual semantics and image knowledge to guide classification.
Figure 3: Overview of KPL. 1) Category texts (e.g., cataract) pass through a Knowledge-Enhanced Base and are encoded by a text encoder to produce semantic embedding centers via visual-based retrieval. Visual images are processed through a visual encoder to obtain visual features. 2) CLIP uses the category name embeddings as text proxies to compare with visual features for classification results. 3) KPL uses semantic embedding centers as text proxies to direct Multimodal Proxy Learning. Final classification results are obtained via the generated multimodal proxies and visual features.
Figure 4: PCA Visualization on the IDRiD with CLIP: (a) CLIP using text proxies for feature visualization shows that the features of "no apparent retinopathy" are very dispersed and difficult to distinguish. (b) Using Sinkhorn for Multimodal Proxy Learning, the features of "no apparent retinopathy" are slightly more clustered. (c) Using the SG for Multimodal Proxy Learning, the features of "no apparent retinopathy" are tightly clustered into a distinct group.
Figure 5: Ablation studies with the KEB show that Multimodal Proxy Learning with KPL descriptions (KPL$_{D}$) outperforms VCD (VCD$_{D}$), highlighting KEB's effectiveness.
...and 1 more figures

Theorems & Definitions (1)

Proposition 1

KPL: Training-Free Medical Knowledge Mining of Vision-Language Models

TL;DR

Abstract

KPL: Training-Free Medical Knowledge Mining of Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (1)