AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

Niu Lian; Jun Li; Jinpeng Wang; Ruisheng Luo; Yaowei Wang; Shu-Tao Xia; Bin Chen

AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

Niu Lian, Jun Li, Jinpeng Wang, Ruisheng Luo, Yaowei Wang, Shu-Tao Xia, Bin Chen

TL;DR

AutoSSVH tackles efficient unsupervised video retrieval by learning compact $K$-bit hash codes through adversarially guided frame sampling that prioritizes informative frames. A differentiable Grade-Net with Gumbel-Softmax TopK sampling selects hard frames, while a Transformer-based hashing network encodes them into hash codes; a Gradient Reversal Layer enables a single-stage adversarial training regime. The method introduces a Component Voting-based hash center and a Point-to-set (P2Set) hash contrastive objective to capture global semantics and neighborhood structure, reinforced by a Frame Reconstruction loss and a View Contrastive loss. Across ActivityNet, FCVID, UCF101, and HMDB51, AutoSSVH achieves state-of-the-art MAP and GMAP at multiple bit lengths, with notable improvements in cross-dataset generalization and training efficiency, demonstrating the practical impact of adversarial hard-frame sampling for video hashing.

Abstract

Self-Supervised Video Hashing (SSVH) compresses videos into hash codes for efficient indexing and retrieval using unlabeled training videos. Existing approaches rely on random frame sampling to learn video features and treat all frames equally. This results in suboptimal hash codes, as it ignores frame-specific information density and reconstruction difficulty. To address this limitation, we propose a new framework, termed AutoSSVH, that employs adversarial frame sampling with hash-based contrastive learning. Our adversarial sampling strategy automatically identifies and selects challenging frames with richer information for reconstruction, enhancing encoding capability. Additionally, we introduce a hash component voting strategy and a point-to-set (P2Set) hash-based contrastive objective, which help capture complex inter-video semantic relationships in the Hamming space and improve the discriminability of learned hash codes. Extensive experiments demonstrate that AutoSSVH achieves superior retrieval efficacy and efficiency compared to state-of-the-art approaches. Code is available at https://github.com/EliSpectre/CVPR25-AutoSSVH.

AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

TL;DR

AutoSSVH tackles efficient unsupervised video retrieval by learning compact

-bit hash codes through adversarially guided frame sampling that prioritizes informative frames. A differentiable Grade-Net with Gumbel-Softmax TopK sampling selects hard frames, while a Transformer-based hashing network encodes them into hash codes; a Gradient Reversal Layer enables a single-stage adversarial training regime. The method introduces a Component Voting-based hash center and a Point-to-set (P2Set) hash contrastive objective to capture global semantics and neighborhood structure, reinforced by a Frame Reconstruction loss and a View Contrastive loss. Across ActivityNet, FCVID, UCF101, and HMDB51, AutoSSVH achieves state-of-the-art MAP and GMAP at multiple bit lengths, with notable improvements in cross-dataset generalization and training efficiency, demonstrating the practical impact of adversarial hard-frame sampling for video hashing.

AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

TL;DR

Abstract

AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)