Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu
TL;DR
This work addresses the challenge of fine-grained semantic alignment in video-text retrieval by introducing UNIFY, a framework that unifies global latent representations with explicit lexicon representations. It maps both videos and texts into a shared lexicon space and employs a two-stage semantics grounding to activate relevant concept dimensions while suppressing irrelevant ones. A unified learning scheme with structure sharing and self-distillation enables mutual enhancement between latent and lexicon branches, achieving state-of-the-art results on MSR-VTT and DiDeMo. The approach offers a scalable path to accurate, fast cross-modal retrieval by combining coarse global representations with fine-grained semantic concepts in a principled, learnable framework.
Abstract
In video-text retrieval, most existing methods adopt the dual-encoder architecture for fast retrieval, which employs two individual encoders to extract global latent representations for videos and texts. However, they face challenges in capturing fine-grained semantic concepts. In this work, we propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics and combines the strengths of latent and lexicon representations for video-text retrieval. Specifically, we map videos and texts into a pre-defined lexicon space, where each dimension corresponds to a semantic concept. A two-stage semantics grounding approach is proposed to activate semantically relevant dimensions and suppress irrelevant dimensions. The learned lexicon representations can thus reflect fine-grained semantics of videos and texts. Furthermore, to leverage the complementarity between latent and lexicon representations, we propose a unified learning scheme to facilitate mutual learning via structure sharing and self-distillation. Experimental results show our UNIFY framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.
