Table of Contents
Fetching ...

Benchmarking Pretrained Vision Embeddings for Near- and Duplicate Detection in Medical Images

Tuan Truong, Farnaz Khun Jush, Matthias Lenga

TL;DR

The paper tackles near- and duplicate image detection in 3D medical images to mitigate data leakage biases. It introduces a 3D-volume matching framework that uses slice-wise embeddings from 2D pretrained models (DINOv1 and DINOv2) with a majority-count aggregation across slices and an optimal threshold $t_{\textrm{opt}}$ selected via Youden's index. Key contributions include a robust count-based aggregation method, empirical benchmarking on the Medical Segmentation Decathlon (MSD) dataset achieving mean $Sensitivity$ of $0.9645$ and $Specificity$ of $0.8559$, and demonstration of natural-image embeddings transferring to medical tasks without fine-tuning. The approach also identifies potential (near-) duplicates in MSD, highlighting practical implications for data curation and fair evaluation in medical imaging pipelines.

Abstract

Near- and duplicate image detection is a critical concern in the field of medical imaging. Medical datasets often contain similar or duplicate images from various sources, which can lead to significant performance issues and evaluation biases, especially in machine learning tasks due to data leakage between training and testing subsets. In this paper, we present an approach for identifying near- and duplicate 3D medical images leveraging publicly available 2D computer vision embeddings. We assessed our approach by comparing embeddings extracted from two state-of-the-art self-supervised pretrained models and two different vector index structures for similarity retrieval. We generate an experimental benchmark based on the publicly available Medical Segmentation Decathlon dataset. The proposed method yields promising results for near- and duplicate image detection achieving a mean sensitivity and specificity of 0.9645 and 0.8559, respectively.

Benchmarking Pretrained Vision Embeddings for Near- and Duplicate Detection in Medical Images

TL;DR

The paper tackles near- and duplicate image detection in 3D medical images to mitigate data leakage biases. It introduces a 3D-volume matching framework that uses slice-wise embeddings from 2D pretrained models (DINOv1 and DINOv2) with a majority-count aggregation across slices and an optimal threshold selected via Youden's index. Key contributions include a robust count-based aggregation method, empirical benchmarking on the Medical Segmentation Decathlon (MSD) dataset achieving mean of and of , and demonstration of natural-image embeddings transferring to medical tasks without fine-tuning. The approach also identifies potential (near-) duplicates in MSD, highlighting practical implications for data curation and fair evaluation in medical imaging pipelines.

Abstract

Near- and duplicate image detection is a critical concern in the field of medical imaging. Medical datasets often contain similar or duplicate images from various sources, which can lead to significant performance issues and evaluation biases, especially in machine learning tasks due to data leakage between training and testing subsets. In this paper, we present an approach for identifying near- and duplicate 3D medical images leveraging publicly available 2D computer vision embeddings. We assessed our approach by comparing embeddings extracted from two state-of-the-art self-supervised pretrained models and two different vector index structures for similarity retrieval. We generate an experimental benchmark based on the publicly available Medical Segmentation Decathlon dataset. The proposed method yields promising results for near- and duplicate image detection achieving a mean sensitivity and specificity of 0.9645 and 0.8559, respectively.
Paper Structure (15 sections, 2 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 15 sections, 2 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Retrieval at case level based on count accumulation
  • Figure 2: Near-duplicate images generated by geometric and intensity transformations under different strengths. For JPEG compression, a region of interest is shown to display the quality degradation which is not visible under human eyes when looking at the whole image.
  • Figure 3: Case-level normalized counts of top 1 and top 3 predictions indexed with HNSW using DINOv1 embeddings. Green bars denote non-duplicate queries and gray and red bars denote duplicate queries. Red bars show duplicate queries in which the top 1 predictions are in the database but do not match the ground truth image case ID. The blue horizontal bar is the threshold resulting in the maximum sum of sensitivity and specificity.
  • Figure B.1: Example of a near-duplicate found in MSD dataset. The BRATS_404 and BRATS_082 are a near-duplicate pair that is different only in the brightness.