Efficient Self-Supervised Video Hashing with Selective State Spaces

Jinpeng Wang; Niu Lian; Jun Li; Yuting Wang; Yan Feng; Bin Chen; Yongbing Zhang; Shu-Tao Xia

Efficient Self-Supervised Video Hashing with Selective State Spaces

Jinpeng Wang, Niu Lian, Jun Li, Yuting Wang, Yan Feng, Bin Chen, Yongbing Zhang, Shu-Tao Xia

TL;DR

The paper addresses the efficiency gap in self-supervised video hashing (SSVH) by introducing S5VH, a Mamba-based model that uses bidirectional state-space layers to capture long-range temporal dependencies with linear scalability. It combines a hash layer for compact representations with a self-local-global (SLG) learning paradigm and a semantic hash center-generation mechanism to exploit global semantic structure, achieving faster and more stable convergence. The key contributions include (i) the first integration of Mamba into SSVH with bidirectional processing, (ii) a hash center generation algorithm that yields semantically consistent, well-separated centers, and (iii) a center-alignment loss that provides a global learning signal, enhancing learning efficiency. Empirically, S5VH outperforms state-of-the-art baselines on four benchmarks, transfers better across datasets, and demonstrates notably improved inference efficiency, making it well-suited for large-scale video retrieval systems with long sequences.

Abstract

Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm. Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity. In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal. Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence. Extensive experiments demonstrate S5VH's improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency. Code is available at https://github.com/gimpong/AAAI25-S5VH.

Efficient Self-Supervised Video Hashing with Selective State Spaces

TL;DR

Abstract

Efficient Self-Supervised Video Hashing with Selective State Spaces

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)