Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

Chen Jiang; Kaiming Huang; Sifeng He; Xudong Yang; Wei Zhang; Xiaobo Zhang; Yuan Cheng; Lei Yang; Qing Wang; Furong Xu; Tan Pan; Wei Chu

Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

Chen Jiang, Kaiming Huang, Sifeng He, Xudong Yang, Wei Zhang, Xiaobo Zhang, Yuan Cheng, Lei Yang, Qing Wang, Furong Xu, Tan Pan, Wei Chu

TL;DR

This work tackles large-scale segment-level CBVR by introducing SSAN, a framework that jointly learns Self-supervised Keyframe Extraction (SKE) and Similarity Pattern Detection (SPD) within an end-to-end architecture. By leveraging offline indexing and an index-based online inference strategy, SSAN achieves high temporal alignment accuracy while dramatically reducing storage and online query costs. The key contributions include (i) SKE for efficient, high-quality keyframe selection, (ii) SPD for robust, pattern-based segment alignment on frame-to-frame similarity maps, (iii) end-to-end multi-task learning that propagates alignment signals back to keyframe selection, and (iv) scalable retrieval pipelines suitable for large-scale datasets. Experimental results on VCDB, CC_WEB, FIVR-200K, and VCDB_plus demonstrate competitive or superior alignment and retrieval performance with substantially lower computational and storage demands, highlighting practical applicability for copyright protection and video search at scale.

Abstract

With the explosive growth of web videos in recent years, large-scale Content-Based Video Retrieval (CBVR) becomes increasingly essential in video filtering, recommendation, and copyright protection. Segment-level CBVR (S-CBVR) locates the start and end time of similar segments in finer granularity, which is beneficial for user browsing efficiency and infringement detection especially in long video scenarios. The challenge of S-CBVR task is how to achieve high temporal alignment accuracy with efficient computation and low storage consumption. In this paper, we propose a Segment Similarity and Alignment Network (SSAN) in dealing with the challenge which is firstly trained end-to-end in S-CBVR. SSAN is based on two newly proposed modules in video retrieval: (1) An efficient Self-supervised Keyframe Extraction (SKE) module to reduce redundant frame features, (2) A robust Similarity Pattern Detection (SPD) module for temporal alignment. In comparison with uniform frame extraction, SKE not only saves feature storage and search time, but also introduces comparable accuracy and limited extra computation time. In terms of temporal alignment, SPD localizes similar segments with higher accuracy and efficiency than existing deep learning methods. Furthermore, we jointly train SSAN with SKE and SPD and achieve an end-to-end improvement. Meanwhile, the two key modules SKE and SPD can also be effectively inserted into other video retrieval pipelines and gain considerable performance improvements. Experimental results on public datasets show that SSAN can obtain higher alignment accuracy while saving storage and online query computational cost compared to existing methods.

Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

TL;DR

Abstract

Paper Structure (18 sections, 7 equations, 7 figures, 4 tables)

This paper contains 18 sections, 7 equations, 7 figures, 4 tables.

Introduction
Related Work
Feature Representation and Indexing
Keyframe Extraction
Temporal Alignment
Segment Similarity and Alignment
Self-supervised Keyframe Extraction
Temporal Alignment based on Similarity Pattern Detection
End-to-End and Multi-Task Joint Learning
Extend to Large-Scale Retrieval
Experiments
Datasets and Evaluation Metrics
Implementation Details
Experimental Results
Temporal Alignment
...and 3 more sections

Figures (7)

Figure 1: Query process of our proposed approach on Segment-level Content Based Video Retrieval (S-CBVR)
Figure 2: Self-supervised Keyframe Extraction (SKE) module
Figure 3: Similarity Pattern Detection (SPD) module
Figure 4: The training process of SSAN. The bottom similarity map is pre-computed in advance from features of a video pair (illustrated in Eq.(3)), which is different with index search part in query process of SSAN in Figure 1.
Figure 5: The pipeline of our segment-level video search using index in detail. The frame features of gallery videos are extracted and indexed offline (marked with dashed arrow).
...and 2 more figures

Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

TL;DR

Abstract

Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (7)