Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval
WonJun Moon, Cheol-Ho Cho, Woojin Jun, Minho Shim, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Jae-Pil Heo
TL;DR
The paper tackles the challenge of achieving high retrieval accuracy for partially relevant videos while maintaining efficiency. It introduces a prototypical PRVR framework that encodes diverse video contexts into a fixed number of prototypes $\hat{\mathbf{P}}$ and aligns them with textual queries via a cross-modal reconstruction task and a uni-modal reconstruction task to preserve context. A weak guidance mechanism and an orthogonal prototype diversification loss $\mathcal{L}_{ortho}$ further refine prototype focus and diversity. Empirically, the approach sets new state-of-the-art results on TVR, ActivityNet Captions, and QVHighlights with substantial memory savings and competitive inference speed. These contributions demonstrate a practical balance between accuracy, efficiency, and robustness for large-scale PRVR in real-world video databases.
Abstract
In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.
