Table of Contents
Fetching ...

Queries Are Not Alone: Clustering Text Embeddings for Video Search

Peyang Liu, Xi Wang, Ziqiang Cui, Wei Ye

TL;DR

The paper tackles the semantic gap in video retrieval by introducing Video-Text Cluster (VTC), a framework that expands the semantic field of textual queries through clustering and mitigates noise with a Sweeper module. It integrates a Video-Text Cluster Attention (VTC-Att) to align cluster-level text semantics with video frames, using a CLIP-based multimodal Transformer and a joint training objective with Cross-Batch Negative Sampling and InfoNCE losses. Key contributions include the Text Clusterer with a dropout-contrastive objective, the Sweeper for noise identification and semantic signaling, and the VTC-Att mechanism that fuses text and video signals for robust retrieval. Empirical results on five datasets (MSRVTT, LSMDC, DiDeMo, VATEX, Charades) demonstrate state-of-the-art performance and underscore the method’s ability to handle diverse queries and complex video content, offering a scalable approach for video search in multimodal systems.

Abstract

The rapid proliferation of video content across various platforms has highlighted the urgent need for advanced video retrieval systems. Traditional methods, which primarily depend on directly matching textual queries with video metadata, often fail to bridge the semantic gap between text descriptions and the multifaceted nature of video content. This paper introduces a novel framework, the Video-Text Cluster (VTC), which enhances video retrieval by clustering text queries to capture a broader semantic scope. We propose a unique clustering mechanism that groups related queries, enabling our system to consider multiple interpretations and nuances of each query. This clustering is further refined by our innovative Sweeper module, which identifies and mitigates noise within these clusters. Additionally, we introduce the Video-Text Cluster-Attention (VTC-Att) mechanism, which dynamically adjusts focus within the clusters based on the video content, ensuring that the retrieval process emphasizes the most relevant textual features. Further experiments have demonstrated that our proposed model surpasses existing state-of-the-art models on five public datasets.

Queries Are Not Alone: Clustering Text Embeddings for Video Search

TL;DR

The paper tackles the semantic gap in video retrieval by introducing Video-Text Cluster (VTC), a framework that expands the semantic field of textual queries through clustering and mitigates noise with a Sweeper module. It integrates a Video-Text Cluster Attention (VTC-Att) to align cluster-level text semantics with video frames, using a CLIP-based multimodal Transformer and a joint training objective with Cross-Batch Negative Sampling and InfoNCE losses. Key contributions include the Text Clusterer with a dropout-contrastive objective, the Sweeper for noise identification and semantic signaling, and the VTC-Att mechanism that fuses text and video signals for robust retrieval. Empirical results on five datasets (MSRVTT, LSMDC, DiDeMo, VATEX, Charades) demonstrate state-of-the-art performance and underscore the method’s ability to handle diverse queries and complex video content, offering a scalable approach for video search in multimodal systems.

Abstract

The rapid proliferation of video content across various platforms has highlighted the urgent need for advanced video retrieval systems. Traditional methods, which primarily depend on directly matching textual queries with video metadata, often fail to bridge the semantic gap between text descriptions and the multifaceted nature of video content. This paper introduces a novel framework, the Video-Text Cluster (VTC), which enhances video retrieval by clustering text queries to capture a broader semantic scope. We propose a unique clustering mechanism that groups related queries, enabling our system to consider multiple interpretations and nuances of each query. This clustering is further refined by our innovative Sweeper module, which identifies and mitigates noise within these clusters. Additionally, we introduce the Video-Text Cluster-Attention (VTC-Att) mechanism, which dynamically adjusts focus within the clusters based on the video content, ensuring that the retrieval process emphasizes the most relevant textual features. Further experiments have demonstrated that our proposed model surpasses existing state-of-the-art models on five public datasets.

Paper Structure

This paper contains 25 sections, 22 equations, 3 figures, 8 tables, 2 algorithms.

Figures (3)

  • Figure 1: A text-video pair in the MSRVTT dataset. A brief, solitary query struggles to adequately capture the complex semantics of the video.
  • Figure 2: Overview of our proposed method. The Text Clusterer groups all texts into distinct clusters. Subsequently, the Text Encoder and Video Encoder encode these clustered texts and video frames. The Sweeper identifies noise within the clustered texts and generates semantic signals. The VTC-Att mechanism then combines these semantic signals with video frame signals to eliminate noise. Ultimately, the refined text cluster embeddings are employed for retrieval.
  • Figure 3: The visualization of $\sigma({\bf QK})$, with an instance derived from the MSRVTT dataset, depict the attention scores for each frame in the video concerning the texts within the text cluster.