Table of Contents
Fetching ...

Counteracting temporal attacks in Video Copy Detection

Katarzyna Fojcik, Piotr Syga

TL;DR

This paper tackles the challenge of Video Copy Detection under adversarial and large-scale conditions by reexamining the META AI Challenge Dual-level approach. It introduces an improved frame extraction method based on local maxima of interframe differences with Hanning smoothing, reducing frame counts by up to 144× and achieving around 2× faster inference while maintaining comparable $μAP$. The method demonstrates enhanced robustness to temporal attacks, with only ~5–7% $μAP$ loss under targeted perturbations, in contrast to the severe degradation faced by the original Dual-level method. These gains yield a more practical, scalable VCD solution for real-world, resource-constrained settings, with future directions including adaptive temporal alignment, cross-modal attacks, and interpretability.

Abstract

Video Copy Detection (VCD) plays a crucial role in copyright protection and content verification by identifying duplicates and near-duplicates in large-scale video databases. The META AI Challenge on video copy detection provided a benchmark for evaluating state-of-the-art methods, with the Dual-level detection approach emerging as a winning solution. This method integrates Video Editing Detection and Frame Scene Detection to handle adversarial transformations and large datasets efficiently. However, our analysis reveals significant limitations in the VED component, particularly in its ability to handle exact copies. Moreover, Dual-level detection shows vulnerability to temporal attacks. To address it, we propose an improved frame selection strategy based on local maxima of interframe differences, which enhances robustness against adversarial temporal modifications while significantly reducing computational overhead. Our method achieves an increase of 1.4 to 5.8 times in efficiency over the standard 1 FPS approach. Compared to Dual-level detection method, our approach maintains comparable micro-average precision ($μ$AP) while also demonstrating improved robustness against temporal attacks. Given 56\% reduced representation size and the inference time of more than 2 times faster, our approach is more suitable to real-world resource restriction.

Counteracting temporal attacks in Video Copy Detection

TL;DR

This paper tackles the challenge of Video Copy Detection under adversarial and large-scale conditions by reexamining the META AI Challenge Dual-level approach. It introduces an improved frame extraction method based on local maxima of interframe differences with Hanning smoothing, reducing frame counts by up to 144× and achieving around 2× faster inference while maintaining comparable . The method demonstrates enhanced robustness to temporal attacks, with only ~5–7% loss under targeted perturbations, in contrast to the severe degradation faced by the original Dual-level method. These gains yield a more practical, scalable VCD solution for real-world, resource-constrained settings, with future directions including adaptive temporal alignment, cross-modal attacks, and interpretability.

Abstract

Video Copy Detection (VCD) plays a crucial role in copyright protection and content verification by identifying duplicates and near-duplicates in large-scale video databases. The META AI Challenge on video copy detection provided a benchmark for evaluating state-of-the-art methods, with the Dual-level detection approach emerging as a winning solution. This method integrates Video Editing Detection and Frame Scene Detection to handle adversarial transformations and large datasets efficiently. However, our analysis reveals significant limitations in the VED component, particularly in its ability to handle exact copies. Moreover, Dual-level detection shows vulnerability to temporal attacks. To address it, we propose an improved frame selection strategy based on local maxima of interframe differences, which enhances robustness against adversarial temporal modifications while significantly reducing computational overhead. Our method achieves an increase of 1.4 to 5.8 times in efficiency over the standard 1 FPS approach. Compared to Dual-level detection method, our approach maintains comparable micro-average precision (AP) while also demonstrating improved robustness against temporal attacks. Given 56\% reduced representation size and the inference time of more than 2 times faster, our approach is more suitable to real-world resource restriction.
Paper Structure (16 sections, 5 figures, 4 tables)

This paper contains 16 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Example of video frame with applied transformations.
  • Figure 2: Interframe differences curve before and after smoothing with Hanning window of size 30, and selected frames of a sample video.
  • Figure 3: Interframe differences curve before and after smoothing with Hanning window of size 50, and selected frames of a sample video.
  • Figure 4: Interframe differences curve before and after smoothing with Hanning window of size 100, and selected frames of a sample video.
  • Figure 5: Selected frames from the first 10 seconds of a sample video obtained using different experimental methods.