Detecting Near-Duplicate Face Images
Sudipta Banerjee, Arun Ross
TL;DR
The paper tackles detecting near-duplicate face images by modeling their relationships with Image Phylogeny Trees (IPTs) and assembling these into Image Phylogeny Forests (IPFs). It combines a locally-scaled spectral clustering stage, which uses fused pixel, PRNU, and face-descriptor features to group near-duplicates, with a graph neural network-based node embedding and a PRNU-driven link-prediction stage to recover hierarchical, parent-child relationships. The approach is demonstrated to be robust across unseen transformations, modalities, configurations, and varying IPT sizes, achieving substantial improvements over baselines (up to approximately 42% in IPF reconstruction accuracy) and generalizing to biometric and natural-scene images. The work offers a domain-agnostic framework for provenance analysis in digital imagery, with potential implications for biometric security and copyright enforcement, and it reports strong performance across demographic groups and diverse datasets.
Abstract
Near-duplicate images are often generated when applying repeated photometric and geometric transformations that produce imperceptible variants of the original image. Consequently, a deluge of near-duplicates can be circulated online posing copyright infringement concerns. The concerns are more severe when biometric data is altered through such nuanced transformations. In this work, we address the challenge of near-duplicate detection in face images by, firstly, identifying the original image from a set of near-duplicates and, secondly, deducing the relationship between the original image and the near-duplicates. We construct a tree-like structure, called an Image Phylogeny Tree (IPT) using a graph-theoretic approach to estimate the relationship, i.e., determine the sequence in which they have been generated. We further extend our method to create an ensemble of IPTs known as Image Phylogeny Forests (IPFs). We rigorously evaluate our method to demonstrate robustness across other modalities, unseen transformations by latest generative models and IPT configurations, thereby significantly advancing the state-of-the-art performance by 42% on IPF reconstruction accuracy.
