Table of Contents
Fetching ...

AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

Niu Lian, Jun Li, Jinpeng Wang, Ruisheng Luo, Yaowei Wang, Shu-Tao Xia, Bin Chen

TL;DR

AutoSSVH tackles efficient unsupervised video retrieval by learning compact $K$-bit hash codes through adversarially guided frame sampling that prioritizes informative frames. A differentiable Grade-Net with Gumbel-Softmax TopK sampling selects hard frames, while a Transformer-based hashing network encodes them into hash codes; a Gradient Reversal Layer enables a single-stage adversarial training regime. The method introduces a Component Voting-based hash center and a Point-to-set (P2Set) hash contrastive objective to capture global semantics and neighborhood structure, reinforced by a Frame Reconstruction loss and a View Contrastive loss. Across ActivityNet, FCVID, UCF101, and HMDB51, AutoSSVH achieves state-of-the-art MAP and GMAP at multiple bit lengths, with notable improvements in cross-dataset generalization and training efficiency, demonstrating the practical impact of adversarial hard-frame sampling for video hashing.

Abstract

Self-Supervised Video Hashing (SSVH) compresses videos into hash codes for efficient indexing and retrieval using unlabeled training videos. Existing approaches rely on random frame sampling to learn video features and treat all frames equally. This results in suboptimal hash codes, as it ignores frame-specific information density and reconstruction difficulty. To address this limitation, we propose a new framework, termed AutoSSVH, that employs adversarial frame sampling with hash-based contrastive learning. Our adversarial sampling strategy automatically identifies and selects challenging frames with richer information for reconstruction, enhancing encoding capability. Additionally, we introduce a hash component voting strategy and a point-to-set (P2Set) hash-based contrastive objective, which help capture complex inter-video semantic relationships in the Hamming space and improve the discriminability of learned hash codes. Extensive experiments demonstrate that AutoSSVH achieves superior retrieval efficacy and efficiency compared to state-of-the-art approaches. Code is available at https://github.com/EliSpectre/CVPR25-AutoSSVH.

AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

TL;DR

AutoSSVH tackles efficient unsupervised video retrieval by learning compact -bit hash codes through adversarially guided frame sampling that prioritizes informative frames. A differentiable Grade-Net with Gumbel-Softmax TopK sampling selects hard frames, while a Transformer-based hashing network encodes them into hash codes; a Gradient Reversal Layer enables a single-stage adversarial training regime. The method introduces a Component Voting-based hash center and a Point-to-set (P2Set) hash contrastive objective to capture global semantics and neighborhood structure, reinforced by a Frame Reconstruction loss and a View Contrastive loss. Across ActivityNet, FCVID, UCF101, and HMDB51, AutoSSVH achieves state-of-the-art MAP and GMAP at multiple bit lengths, with notable improvements in cross-dataset generalization and training efficiency, demonstrating the practical impact of adversarial hard-frame sampling for video hashing.

Abstract

Self-Supervised Video Hashing (SSVH) compresses videos into hash codes for efficient indexing and retrieval using unlabeled training videos. Existing approaches rely on random frame sampling to learn video features and treat all frames equally. This results in suboptimal hash codes, as it ignores frame-specific information density and reconstruction difficulty. To address this limitation, we propose a new framework, termed AutoSSVH, that employs adversarial frame sampling with hash-based contrastive learning. Our adversarial sampling strategy automatically identifies and selects challenging frames with richer information for reconstruction, enhancing encoding capability. Additionally, we introduce a hash component voting strategy and a point-to-set (P2Set) hash-based contrastive objective, which help capture complex inter-video semantic relationships in the Hamming space and improve the discriminability of learned hash codes. Extensive experiments demonstrate that AutoSSVH achieves superior retrieval efficacy and efficiency compared to state-of-the-art approaches. Code is available at https://github.com/EliSpectre/CVPR25-AutoSSVH.

Paper Structure

This paper contains 46 sections, 31 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a) Existing methods treat all frames equally and randomly sample frames from the video. (b) In contrast, our approach leverages the Gumbel-Softmax technique to achieve differentiable frame sampling. (c) We propose a GAN-based framework for hash learning, where the frame sampler tries to maximize learning objectives and the hashing network learns to minimize. (d) We further derive hash anchors via a component voting strategy, which supplements global semantic information and enhances hash learning.
  • Figure 2: Our Proposed AutoSSVH. (a) AutoSSVH Pipeline. AutoSSVH leverages an adversarially-guided automated sampler, which utilizes the Gumbel-Softmax TopK operation and gradient reversal to select frames exhibiting high reconstruction difficulty within a video. This process generates two sequences with reduced informational content, which are subsequently processed by the hashing network to generate hash codes $\bm{b_i}$ and $\bm{b_j}$ for view contrast learning. $\mathcal{L}_\mathsf{VC}$ is then computed based on these hash codes and those from other sequences. The encoder generates hash codes for the entire training set, followed by pseudo-labeling via k-means clustering. Component voting is then applied to determine cluster centers. Point-to-set (P2Set) hash-based learning is performed next, with $\mathcal{L}_\mathsf{P2Set}$ computed accordingly. Finally, the video is reconstructed, and the frame reconstruction loss $\mathcal{L}_\mathsf{FR}$ is evaluated. (b) The Framework of Grade-Net. (c) Point-to-set Hash-based Learning. ($\alpha$) Anchor Generation via Component Voting ($\beta$) High-level Semantic Learning in Hashing.
  • Figure 3: Comparison of retrieval performance using mAP@N on ActivityNet, FCVID, UCF101 and HMDB51.
  • Figure 4: Retrieval PR curves of different models on UCF101.
  • Figure 5: The impact of the automated adversarial sampling strategy and point-to-set (P2Set) hash-based learning on the retrieval efficiency of AutoSSVH.