Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets

Noam Glazner; Noam Tsfaty; Sharon Shalev; Avishai Weizman

Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets

Noam Glazner, Noam Tsfaty, Sharon Shalev, Avishai Weizman

TL;DR

The paper addresses information leakage in video-derived frame datasets caused by temporal and spatial correlations that inflate evaluation when using random data splits. It introduces a cluster-based frame selection pipeline that converts frames to feature vectors, reduces them with PaCMAP to $\mathbf{z}_{k,i}$, and groups them with HDBSCAN so that entire clusters are assigned to a single data split. Key findings show that deep representations, particularly DINO-V3, yield the best clustering performance on ImageNet-VID and UCF101, with XFeat+VLAD also competitive, while traditional SIFT-based approaches can lag, especially on temporally variable data. The method is simple, scalable, and readily integrable into existing pipelines, offering fairer model evaluation for video-derived datasets and a path to quantify leakage reduction across datasets.

Abstract

We propose a cluster-based frame selection strategy to mitigate information leakage in video-derived frames datasets. By grouping visually similar frames before splitting into training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.

Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets

TL;DR

, and groups them with HDBSCAN so that entire clusters are assigned to a single data split. Key findings show that deep representations, particularly DINO-V3, yield the best clustering performance on ImageNet-VID and UCF101, with XFeat+VLAD also competitive, while traditional SIFT-based approaches can lag, especially on temporally variable data. The method is simple, scalable, and readily integrable into existing pipelines, offering fairer model evaluation for video-derived datasets and a path to quantify leakage reduction across datasets.

Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets

TL;DR

Abstract

Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)