Table of Contents
Fetching ...

Mitigating Semantic Collapse in Partially Relevant Video Retrieval

WonJun Moon, MinSeok Jung, Gilhan Park, Tae-Young Kim, Cheol-Ho Cho, Woojin Jun, Jae-Pil Heo

TL;DR

This work tackles semantic collapse in partially relevant video retrieval by introducing Text Correlation Preservation Learning (TCPL) to regularize text embeddings via the semantic structure of CLIP, and Cross-Branch Video Alignment (CBVA) to disentangle multi-context video representations across temporal scales. It further employs Order-Preserving Token Merging (OP-ToMe) and adaptive CBVA to construct coherent yet distinctive video clips, enabling fine-grained frame-clip alignment. Across four PRVR benchmarks, the approach achieves state-of-the-art performance, notably improving SumR on QVHighlights by up to 8 points and demonstrating robust gains on additional datasets. The framework advances PRVR by mitigating semantic collapse in both text and video spaces, with practical impact on search efficiency and retrieval accuracy, while acknowledging CLIP-dependent limitations and training costs as caveats.

Abstract

Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.

Mitigating Semantic Collapse in Partially Relevant Video Retrieval

TL;DR

This work tackles semantic collapse in partially relevant video retrieval by introducing Text Correlation Preservation Learning (TCPL) to regularize text embeddings via the semantic structure of CLIP, and Cross-Branch Video Alignment (CBVA) to disentangle multi-context video representations across temporal scales. It further employs Order-Preserving Token Merging (OP-ToMe) and adaptive CBVA to construct coherent yet distinctive video clips, enabling fine-grained frame-clip alignment. Across four PRVR benchmarks, the approach achieves state-of-the-art performance, notably improving SumR on QVHighlights by up to 8 points and demonstrating robust gains on additional datasets. The framework advances PRVR by mitigating semantic collapse in both text and video spaces, with practical impact on search efficiency and retrieval accuracy, while acknowledging CLIP-dependent limitations and training costs as caveats.

Abstract

Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.

Paper Structure

This paper contains 16 sections, 10 equations, 3 figures, 11 tables, 3 algorithms.

Figures (3)

  • Figure 1: Illustration of semantic collapse. (Up) Untrimmed videos in PRVR encompass diverse semantics that can be described by different texts. As a result, semantic segments (both text and video clips) from the same video may convey very different meanings, while segments from different videos can nonetheless be closely related. For example, Q2 of Video A and Q1 of Video B both depict “holding a dog”. (Down) Since all queries tied to a given video are treated as positives and negative queries drawn from other videos, the model pulls together all text embeddings (and their corresponding video segments) for that video, regardless of true meaning, and pushes apart semantically similar queries (and segments) from different videos. (a) illustrates that queries of the same video are pulled together regardless of their semantic relationships (left), while queries with similar context (holding a dog) are pushed apart (right). (b) shows that video segments also suffer from the same phenomenon.
  • Figure 2: Method overview. We extract text and visual tokens with pretrained backbones, which are then processed via transformer layers. Text tokens are aggregated via attention pooling to produce a single query token $\bar{T}$ for each text query. Also, following prior works, dual-branch visual tokens are encoded (both frame- and clip-level), producing a sequence $\bar{V}$ of video tokens for each level. A baseline retrieval loss $\mathcal{L}^{\text{base}}$ aligns $\bar{T}$ with the most similar video token at each level. To mitigate text-side semantic collapse, Text Correlation Preservation Learning transfers CLIP's query relationships. On the other hand, Cross-Branch Video Alignment aligns hierarchical segments by timestamping to mitigate collapse and preserve visual details. Furthermore, CBVA is precisely enhanced by constructing coherent clips with Order-Preserving Token Merging and improving adaptivity (illustrated in Sec. \ref{['Sec.vr']}).
  • Figure 3: Retrieval example. 'GT Video' denotes the ground-truth paired video to the query. $\checkmark$, $\triangle$, and ✗ indicate whether the retrieved video token is semantically aligned or not, regardless of its origin from the ground-truth video.