Table of Contents
Fetching ...

Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval

WonJun Moon, Cheol-Ho Cho, Woojin Jun, Minho Shim, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Jae-Pil Heo

TL;DR

The paper tackles the challenge of achieving high retrieval accuracy for partially relevant videos while maintaining efficiency. It introduces a prototypical PRVR framework that encodes diverse video contexts into a fixed number of prototypes $\hat{\mathbf{P}}$ and aligns them with textual queries via a cross-modal reconstruction task and a uni-modal reconstruction task to preserve context. A weak guidance mechanism and an orthogonal prototype diversification loss $\mathcal{L}_{ortho}$ further refine prototype focus and diversity. Empirically, the approach sets new state-of-the-art results on TVR, ActivityNet Captions, and QVHighlights with substantial memory savings and competitive inference speed. These contributions demonstrate a practical balance between accuracy, efficiency, and robustness for large-scale PRVR in real-world video databases.

Abstract

In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.

Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval

TL;DR

The paper tackles the challenge of achieving high retrieval accuracy for partially relevant videos while maintaining efficiency. It introduces a prototypical PRVR framework that encodes diverse video contexts into a fixed number of prototypes and aligns them with textual queries via a cross-modal reconstruction task and a uni-modal reconstruction task to preserve context. A weak guidance mechanism and an orthogonal prototype diversification loss further refine prototype focus and diversity. Empirically, the approach sets new state-of-the-art results on TVR, ActivityNet Captions, and QVHighlights with substantial memory savings and competitive inference speed. These contributions demonstrate a practical balance between accuracy, efficiency, and robustness for large-scale PRVR in real-world video databases.

Abstract

In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.

Paper Structure

This paper contains 19 sections, 9 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparisons in video encoding process. (a) MS-SL utilizes exhaustive clip modeling based on varying lengths of clip windows to encode contexts at diverse temporal scales. (b) GMMFormer performs a similarity-aware feature aggregation via self-attention constrained with predefined Gaussian kernels to reflect locality. Such an adaptive scheme provides high efficiency. (c) To exploit the semantic richness in exhaustive clip modeling without sacrificing efficiency, we learn a fixed number of prototypes that aggregate diverse (varying lengths or potentially disjointed) contexts within a video. (Down) Our proposed method has superiority over previous works in terms of the accuracy-efficiency trade-off.
  • Figure 2: An overview of the clip branch in our prototypical framework which is consistent with the frame branch except for the existence of exhaustive clip modeling. For the video stream, prototype aggregation is implemented upon clip features formed by an exhaustive clip modeling strategy. Text queries are encoded and aggregated through an attention-pooling layer. Finally, the similarity matching between the visual prototypes $\hat{\mathbf{P}}^c$ and query token $\hat{\mathbf{T}}^c$ is implemented to calculate the text-to-video score $S^c$ for retrieval. On the right side, uni- and cross-modal reconstructions are performed with constructed prototypes during training. Only the retrieved prototype with the maximum similarity is utilized to reconstruct masked text word (cross-modal) while all prototypes are exploited to reconstruct video frames (uni-modal).
  • Figure 3: An architectural overview of reconstruction tasks. (a) For a cross-modal scenario, masked query features are reconstructed with the retrieved video prototype $\hat{p}_{*}$ via a transformer-based decoder. This aligns the retrieved prototype with the corresponding textual features. (b) For a uni-modal scenario, visual prototypes are processed via an MLP-based decoder to reconstruct the frame-wise features to mitigate the visual information loss during aggregation.
  • Figure 4: An example of prototypes' attention weights on each frame. Each box represents each prototype.
  • Figure 5: Attendance on frames within moments for visual prototypes in the order of the similarity to a given query. The higher the similarity between the prototype and the text query, we observe the higher the attendance on moment frames.
  • ...and 5 more figures