Table of Contents
Fetching ...

CLIP-DR: Textual Knowledge-Guided Diabetic Retinopathy Grading with Ranking-aware Prompting

Qinkai Yu, Jianyang Xie, Anh Nguyen, He Zhao, Jiong Zhang, Huazhu Fu, Yitian Zhao, Yalin Zheng, Yanda Meng

TL;DR

Diabetic retinopathy (DR) grading from color fundus images suffers from significant appearance variability and long-tailed class distributions. The paper CLIP-DR introduces a CLIP-based framework that treats DR grading as image-text matching, augmented with ranking-aware prompting to encode the natural ordinal order and a Similarity Matrix Smooth module to mitigate data imbalance. The method optimizes with a main loss based on KL divergence and a rank loss that enforces neighboring-class order, achieving state-of-the-art performance on the GDRBench across six DR datasets and demonstrating strong generalization. By leveraging ordinal information and calibrated text-image alignment, CLIP-DR enhances robustness to visual variability and class imbalance, offering a practical approach for scalable, generalizable DR grading in clinical settings.

Abstract

Diabetic retinopathy (DR) is a complication of diabetes and usually takes decades to reach sight-threatening levels. Accurate and robust detection of DR severity is critical for the timely management and treatment of diabetes. However, most current DR grading methods suffer from insufficient robustness to data variability (\textit{e.g.} colour fundus images), posing a significant difficulty for accurate and robust grading. In this work, we propose a novel DR grading framework CLIP-DR based on three observations: 1) Recent pre-trained visual language models, such as CLIP, showcase a notable capacity for generalisation across various downstream tasks, serving as effective baseline models. 2) The grading of image-text pairs for DR often adheres to a discernible natural sequence, yet most existing DR grading methods have primarily overlooked this aspect. 3) A long-tailed distribution among DR severity levels complicates the grading process. This work proposes a novel ranking-aware prompting strategy to help the CLIP model exploit the ordinal information. Specifically, we sequentially design learnable prompts between neighbouring text-image pairs in two different ranking directions. Additionally, we introduce a Similarity Matrix Smooth module into the structure of CLIP to balance the class distribution. Finally, we perform extensive comparisons with several state-of-the-art methods on the GDRBench benchmark, demonstrating our CLIP-DR's robustness and superior performance. The implementation code is available \footnote{\url{https://github.com/Qinkaiyu/CLIP-DR}

CLIP-DR: Textual Knowledge-Guided Diabetic Retinopathy Grading with Ranking-aware Prompting

TL;DR

Diabetic retinopathy (DR) grading from color fundus images suffers from significant appearance variability and long-tailed class distributions. The paper CLIP-DR introduces a CLIP-based framework that treats DR grading as image-text matching, augmented with ranking-aware prompting to encode the natural ordinal order and a Similarity Matrix Smooth module to mitigate data imbalance. The method optimizes with a main loss based on KL divergence and a rank loss that enforces neighboring-class order, achieving state-of-the-art performance on the GDRBench across six DR datasets and demonstrating strong generalization. By leveraging ordinal information and calibrated text-image alignment, CLIP-DR enhances robustness to visual variability and class imbalance, offering a practical approach for scalable, generalizable DR grading in clinical settings.

Abstract

Diabetic retinopathy (DR) is a complication of diabetes and usually takes decades to reach sight-threatening levels. Accurate and robust detection of DR severity is critical for the timely management and treatment of diabetes. However, most current DR grading methods suffer from insufficient robustness to data variability (\textit{e.g.} colour fundus images), posing a significant difficulty for accurate and robust grading. In this work, we propose a novel DR grading framework CLIP-DR based on three observations: 1) Recent pre-trained visual language models, such as CLIP, showcase a notable capacity for generalisation across various downstream tasks, serving as effective baseline models. 2) The grading of image-text pairs for DR often adheres to a discernible natural sequence, yet most existing DR grading methods have primarily overlooked this aspect. 3) A long-tailed distribution among DR severity levels complicates the grading process. This work proposes a novel ranking-aware prompting strategy to help the CLIP model exploit the ordinal information. Specifically, we sequentially design learnable prompts between neighbouring text-image pairs in two different ranking directions. Additionally, we introduce a Similarity Matrix Smooth module into the structure of CLIP to balance the class distribution. Finally, we perform extensive comparisons with several state-of-the-art methods on the GDRBench benchmark, demonstrating our CLIP-DR's robustness and superior performance. The implementation code is available \footnote{\url{https://github.com/Qinkaiyu/CLIP-DR}
Paper Structure (9 sections, 9 equations, 4 figures, 4 tables)

This paper contains 9 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An example of learnable rank-aware prompting with image class 'Mild' for a given Image $I_{2}$.$[C_{1},...,C_{5} ]$ represent 5 different DR grading classes. The similarity score is obtained by the inner product of the image feature and text feature. Designing learnable rank-aware prompts that satisfy the following two inequalities enables the model to learn natural order information.
  • Figure 1: Class Activation Map for CLIP-DR and OrdinalCLIP. The top line is the original image, the second line is CLIP-DR (Ours), and the last line is OrdinalCLIP. CLIP-DR and OrdinalCLIP use the same training data (DG test, 'APTOS' as target)and train 100 epochs. We highlighted the differences in class activation diagrams with blue boxes.
  • Figure 2: Overview of the proposed CLIP-DR framework for training and inference. Images are processed through an image encoder to extract image features $X$. The corresponding text labels are fed into the text encoder, generating text embeddings for labels $T$. The similarity matrix $\mathcal{S}$ is obtained through the inner product. Finally, the SMS module converts $\mathcal{S}$ into calibration features $\tilde{\mathcal{S}}$ with the same dimensions. The learnable rank-aware prompt strategy is implemented explicitly by $\mathcal{L}_{rank}$, which uses ranking information independently in the left and right directions, and $\mathcal{L}_{main}$ follows the practice of CLIP radford2021learning.
  • Figure 3: The image-text similarity matrix obtained by the inner product of image feature and text feature: $X\cdot T^{t}$. This matrix is an intuitive representation of the rank between different image-text pairs. The X-axis represents five different text labels, and the Y-axis represents real images of different classes. We average the results of the six sub-datasets and present them in this figure for our CLIP-DR and OrdinalCLIP li2022ordinalclip. CLIP-DR can learn rank-aware text-image features.