Table of Contents
Fetching ...

Detecting Media Clones in Cultural Repositories Using a Positive Unlabeled Learning Approach

V. Sevetlidis, V. Arampatzakis, M. Karta, I. Mourthos, D. Tsiafaki, G. Pavlidis

Abstract

We formulate curator-in-the-loop duplicate discovery in the AtticPOT repository as a Positive-Unlabeled (PU) learning problem. Given a single anchor per artefact, we train a lightweight per-query Clone Encoder on augmented views of the anchor and score the unlabeled repository with an interpretable threshold on the latent l_2 norm. The system proposes candidates for curator verification, uncovering cross-record duplicates that were not verified a priori. On CIFAR-10 we obtain F1=96.37 (AUROC=97.97); on AtticPOT we reach F1=90.79 (AUROC=98.99), improving F1 by +7.70 points over the best baseline (SVDD) under the same lightweight backbone. Qualitative "find-similar" panels show stable neighbourhoods across viewpoint and condition. The method avoids explicit negatives, offers a transparent operating point, and fits de-duplication, record linkage, and curator-in-the-loop workflows.

Detecting Media Clones in Cultural Repositories Using a Positive Unlabeled Learning Approach

Abstract

We formulate curator-in-the-loop duplicate discovery in the AtticPOT repository as a Positive-Unlabeled (PU) learning problem. Given a single anchor per artefact, we train a lightweight per-query Clone Encoder on augmented views of the anchor and score the unlabeled repository with an interpretable threshold on the latent l_2 norm. The system proposes candidates for curator verification, uncovering cross-record duplicates that were not verified a priori. On CIFAR-10 we obtain F1=96.37 (AUROC=97.97); on AtticPOT we reach F1=90.79 (AUROC=98.99), improving F1 by +7.70 points over the best baseline (SVDD) under the same lightweight backbone. Qualitative "find-similar" panels show stable neighbourhoods across viewpoint and condition. The method avoids explicit negatives, offers a transparent operating point, and fits de-duplication, record linkage, and curator-in-the-loop workflows.

Paper Structure

This paper contains 17 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Proposed clone-detection workflow: an anchor is augmented into clones and contrasted with an unlabeled pool; a Positive–Unlabeled–trained encoder discards non-matches and flags near-duplicates.
  • Figure 2: Training batch illustration on CIFAR--10. Panels A: the anchor image, B: clones of the anchor via augmentation and C: unlabeled pool mixing unlabeled positives.
  • Figure 3: Duplicate detection performance on CIFAR-10 (left) and AtticPOT (right).
  • Figure 4: Qualitative retrieval on AtticPOT. For each query (leftmost image in a row), we show the nine highest-ranked images (green borders) and the single least similar (rightmost, red border).
  • Figure 5: Positive vs. negative latent norm distributions for a sample anchor (left). Across anchors: distribution of $\mu$ (center) and learned $m$ (right). $\mu$ varies with the anchor.
  • ...and 1 more figures