Table of Contents
Fetching ...

Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets

Noam Glazner, Noam Tsfaty, Sharon Shalev, Avishai Weizman

TL;DR

The paper addresses information leakage in video-derived frame datasets caused by temporal and spatial correlations that inflate evaluation when using random data splits. It introduces a cluster-based frame selection pipeline that converts frames to feature vectors, reduces them with PaCMAP to $\mathbf{z}_{k,i}$, and groups them with HDBSCAN so that entire clusters are assigned to a single data split. Key findings show that deep representations, particularly DINO-V3, yield the best clustering performance on ImageNet-VID and UCF101, with XFeat+VLAD also competitive, while traditional SIFT-based approaches can lag, especially on temporally variable data. The method is simple, scalable, and readily integrable into existing pipelines, offering fairer model evaluation for video-derived datasets and a path to quantify leakage reduction across datasets.

Abstract

We propose a cluster-based frame selection strategy to mitigate information leakage in video-derived frames datasets. By grouping visually similar frames before splitting into training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.

Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets

TL;DR

The paper addresses information leakage in video-derived frame datasets caused by temporal and spatial correlations that inflate evaluation when using random data splits. It introduces a cluster-based frame selection pipeline that converts frames to feature vectors, reduces them with PaCMAP to , and groups them with HDBSCAN so that entire clusters are assigned to a single data split. Key findings show that deep representations, particularly DINO-V3, yield the best clustering performance on ImageNet-VID and UCF101, with XFeat+VLAD also competitive, while traditional SIFT-based approaches can lag, especially on temporally variable data. The method is simple, scalable, and readily integrable into existing pipelines, offering fairer model evaluation for video-derived datasets and a path to quantify leakage reduction across datasets.

Abstract

We propose a cluster-based frame selection strategy to mitigate information leakage in video-derived frames datasets. By grouping visually similar frames before splitting into training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.

Paper Structure

This paper contains 7 sections, 2 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Illustration of the proposed cluster-based frame selection pipeline. Each video is decomposed into individual frames, from which features are extracted using semantic representations (e.g., CLIP Radford2021LearningTV), handcrafted descriptors (e.g., HOG dalal2005histograms), or lightweight learned features (e.g., XFeat potje2024xfeat). Dimensionality reduction (e.g., PaCMAP wang2021understanding) and clustering (e.g., HDBSCAN mcinnes2017hdbscan) are then applied to group visually similar frames before dataset partitioning.