Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets
Noam Glazner, Noam Tsfaty, Sharon Shalev, Avishai Weizman
TL;DR
The paper addresses information leakage in video-derived frame datasets caused by temporal and spatial correlations that inflate evaluation when using random data splits. It introduces a cluster-based frame selection pipeline that converts frames to feature vectors, reduces them with PaCMAP to $\mathbf{z}_{k,i}$, and groups them with HDBSCAN so that entire clusters are assigned to a single data split. Key findings show that deep representations, particularly DINO-V3, yield the best clustering performance on ImageNet-VID and UCF101, with XFeat+VLAD also competitive, while traditional SIFT-based approaches can lag, especially on temporally variable data. The method is simple, scalable, and readily integrable into existing pipelines, offering fairer model evaluation for video-derived datasets and a path to quantify leakage reduction across datasets.
Abstract
We propose a cluster-based frame selection strategy to mitigate information leakage in video-derived frames datasets. By grouping visually similar frames before splitting into training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.
