Table of Contents
Fetching ...

OFF-CLIP: Improving Normal Detection Confidence in Radiology CLIP with Simple Off-Diagonal Term Auto-Adjustment

Junhyun Park, Chanyu Moon, Donghwan Lee, Kyungsu Kim, Minho Hwang

TL;DR

OFF-CLIP tackles two core problems in radiology CLIP: poor normal sample clustering leading to false positives and misalignment from normal text in abnormal reports causing false negatives. It introduces an off-diagonal term loss to reinforce normal clustering and an abnormal InfoNCE loss to preserve abnormal discrimination, supplemented by GPT-4o-based text prompting and sentence-level filtering to reduce misalignment. Across multiple chest X-ray datasets, OFF-CLIP yields substantial gains in normal detection while preserving or improving abnormal detection, and it enhances zero-shot grounding for anomaly localization. The approach is architecture-agnostic and offers practical benefits for medical vision-language systems by improving screening reliability and localization accuracy.

Abstract

Contrastive Language-Image Pre-Training (CLIP) has enabled zero-shot classification in radiology, reducing reliance on manual annotations. However, conventional contrastive learning struggles with normal case detection due to its strict intra-sample alignment, which disrupts normal sample clustering and leads to high false positives (FPs) and false negatives (FNs). To address these issues, we propose OFF-CLIP, a contrastive learning refinement that improves normal detection by introducing an off-diagonal term loss to enhance normal sample clustering and applying sentence-level text filtering to mitigate FNs by removing misaligned normal statements from abnormal reports. OFF-CLIP can be applied to radiology CLIP models without requiring any architectural modifications. Experimental results show that OFF-CLIP significantly improves normal classification, achieving a 0.61 Area under the curve (AUC) increase on VinDr-CXR over CARZero, the state-of-the-art zero-shot classification baseline, while maintaining or improving abnormal classification performance. Additionally, OFF-CLIP enhances zero-shot grounding by improving pointing game accuracy, confirming better anomaly localization. These results demonstrate OFF-CLIP's effectiveness as a robust and efficient enhancement for medical vision-language models.

OFF-CLIP: Improving Normal Detection Confidence in Radiology CLIP with Simple Off-Diagonal Term Auto-Adjustment

TL;DR

OFF-CLIP tackles two core problems in radiology CLIP: poor normal sample clustering leading to false positives and misalignment from normal text in abnormal reports causing false negatives. It introduces an off-diagonal term loss to reinforce normal clustering and an abnormal InfoNCE loss to preserve abnormal discrimination, supplemented by GPT-4o-based text prompting and sentence-level filtering to reduce misalignment. Across multiple chest X-ray datasets, OFF-CLIP yields substantial gains in normal detection while preserving or improving abnormal detection, and it enhances zero-shot grounding for anomaly localization. The approach is architecture-agnostic and offers practical benefits for medical vision-language systems by improving screening reliability and localization accuracy.

Abstract

Contrastive Language-Image Pre-Training (CLIP) has enabled zero-shot classification in radiology, reducing reliance on manual annotations. However, conventional contrastive learning struggles with normal case detection due to its strict intra-sample alignment, which disrupts normal sample clustering and leads to high false positives (FPs) and false negatives (FNs). To address these issues, we propose OFF-CLIP, a contrastive learning refinement that improves normal detection by introducing an off-diagonal term loss to enhance normal sample clustering and applying sentence-level text filtering to mitigate FNs by removing misaligned normal statements from abnormal reports. OFF-CLIP can be applied to radiology CLIP models without requiring any architectural modifications. Experimental results show that OFF-CLIP significantly improves normal classification, achieving a 0.61 Area under the curve (AUC) increase on VinDr-CXR over CARZero, the state-of-the-art zero-shot classification baseline, while maintaining or improving abnormal classification performance. Additionally, OFF-CLIP enhances zero-shot grounding by improving pointing game accuracy, confirming better anomaly localization. These results demonstrate OFF-CLIP's effectiveness as a robust and efficient enhancement for medical vision-language models.

Paper Structure

This paper contains 18 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The figure illustrates two key issues in (a) conventional diagonal InfoNCE loss in Radiology CLIP: (b) High false positives due to alignment restricted to matched pairs, forcing apart other normal samples, and (c) High false negatives caused by normal sentences in abnormal reports, bringing normal sentences closer to abnormal images while pushing them away from normal images.
  • Figure 2: OFF-CLIP leverages an off-diagonal term loss to effectively cluster normal samples within a batch. Abnormal pairs are further refined using an abnormal-only InfoNCE loss. Reports are processed using an LLM for text prompting, and sentence-level anomaly classification is applied to label each sentence. Normal sentences in abnormal reports are then filtered to reduce misalignment.
  • Figure 3: Visualization of attention maps on VinDr-CXR. Red boxes indicate ground truth bounding boxes for each diseases. Highlighted pixels represent regions with higher activation weights, linking specific words to image areas.