Table of Contents
Fetching ...

Efficient Self-Supervised Video Hashing with Selective State Spaces

Jinpeng Wang, Niu Lian, Jun Li, Yuting Wang, Yan Feng, Bin Chen, Yongbing Zhang, Shu-Tao Xia

TL;DR

The paper addresses the efficiency gap in self-supervised video hashing (SSVH) by introducing S5VH, a Mamba-based model that uses bidirectional state-space layers to capture long-range temporal dependencies with linear scalability. It combines a hash layer for compact representations with a self-local-global (SLG) learning paradigm and a semantic hash center-generation mechanism to exploit global semantic structure, achieving faster and more stable convergence. The key contributions include (i) the first integration of Mamba into SSVH with bidirectional processing, (ii) a hash center generation algorithm that yields semantically consistent, well-separated centers, and (iii) a center-alignment loss that provides a global learning signal, enhancing learning efficiency. Empirically, S5VH outperforms state-of-the-art baselines on four benchmarks, transfers better across datasets, and demonstrates notably improved inference efficiency, making it well-suited for large-scale video retrieval systems with long sequences.

Abstract

Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm. Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity. In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal. Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence. Extensive experiments demonstrate S5VH's improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency. Code is available at https://github.com/gimpong/AAAI25-S5VH.

Efficient Self-Supervised Video Hashing with Selective State Spaces

TL;DR

The paper addresses the efficiency gap in self-supervised video hashing (SSVH) by introducing S5VH, a Mamba-based model that uses bidirectional state-space layers to capture long-range temporal dependencies with linear scalability. It combines a hash layer for compact representations with a self-local-global (SLG) learning paradigm and a semantic hash center-generation mechanism to exploit global semantic structure, achieving faster and more stable convergence. The key contributions include (i) the first integration of Mamba into SSVH with bidirectional processing, (ii) a hash center generation algorithm that yields semantically consistent, well-separated centers, and (iii) a center-alignment loss that provides a global learning signal, enhancing learning efficiency. Empirically, S5VH outperforms state-of-the-art baselines on four benchmarks, transfers better across datasets, and demonstrates notably improved inference efficiency, making it well-suited for large-scale video retrieval systems with long sequences.

Abstract

Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm. Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity. In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal. Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence. Extensive experiments demonstrate S5VH's improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency. Code is available at https://github.com/gimpong/AAAI25-S5VH.

Paper Structure

This paper contains 37 sections, 21 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Highlights: (a) Our S5VH based on Mamba exhibits lower inference overheads on memory and computation. The efficiency advantages are scalable and more notable under larger frame numbers. (b) The introduced global learning signal in the hash space effectively enhances training efficiency, showing faster and better convergence.
  • Figure 2: Overview of S5VH (best viewed in color). (a) The encoder and decoder comprise bidirectional Mamba layers for effective and efficient temporal modeling. (b) We propose an optimization algorithm to transform the feature-space global structure into well-separated and semantically consistent hash centers. (c) We encode video frames into features and get a pseudo label of the nearest feature cluster. Then, we sample two views of the video and process them with the shared encoder and hash layer process, obtaining frame-wise soft hash vectors. Next, we aggregate frame hash vectors to video-level hash vectors for contrastive learning and center alignment. Meanwhile, we employ an auxiliary decoder (removed in inference) to reconstruct the masked frames, using the frame hash vectors of each view.
  • Figure 3: Retrieval performance comparison by mAP@$N$.
  • Figure 4: Retrieval PR curves of different models on the UCF101 and HMDB51 datasets.
  • Figure 5: The t-SNE visualization of the learned hash codes on UCF101. Data points of the same color correspond to the same category. Only the first 10 classes are visualized.