OFF-CLIP: Improving Normal Detection Confidence in Radiology CLIP with Simple Off-Diagonal Term Auto-Adjustment
Junhyun Park, Chanyu Moon, Donghwan Lee, Kyungsu Kim, Minho Hwang
TL;DR
OFF-CLIP tackles two core problems in radiology CLIP: poor normal sample clustering leading to false positives and misalignment from normal text in abnormal reports causing false negatives. It introduces an off-diagonal term loss to reinforce normal clustering and an abnormal InfoNCE loss to preserve abnormal discrimination, supplemented by GPT-4o-based text prompting and sentence-level filtering to reduce misalignment. Across multiple chest X-ray datasets, OFF-CLIP yields substantial gains in normal detection while preserving or improving abnormal detection, and it enhances zero-shot grounding for anomaly localization. The approach is architecture-agnostic and offers practical benefits for medical vision-language systems by improving screening reliability and localization accuracy.
Abstract
Contrastive Language-Image Pre-Training (CLIP) has enabled zero-shot classification in radiology, reducing reliance on manual annotations. However, conventional contrastive learning struggles with normal case detection due to its strict intra-sample alignment, which disrupts normal sample clustering and leads to high false positives (FPs) and false negatives (FNs). To address these issues, we propose OFF-CLIP, a contrastive learning refinement that improves normal detection by introducing an off-diagonal term loss to enhance normal sample clustering and applying sentence-level text filtering to mitigate FNs by removing misaligned normal statements from abnormal reports. OFF-CLIP can be applied to radiology CLIP models without requiring any architectural modifications. Experimental results show that OFF-CLIP significantly improves normal classification, achieving a 0.61 Area under the curve (AUC) increase on VinDr-CXR over CARZero, the state-of-the-art zero-shot classification baseline, while maintaining or improving abnormal classification performance. Additionally, OFF-CLIP enhances zero-shot grounding by improving pointing game accuracy, confirming better anomaly localization. These results demonstrate OFF-CLIP's effectiveness as a robust and efficient enhancement for medical vision-language models.
