Table of Contents
Fetching ...

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu

TL;DR

This work addresses the challenge of fine-grained semantic alignment in video-text retrieval by introducing UNIFY, a framework that unifies global latent representations with explicit lexicon representations. It maps both videos and texts into a shared lexicon space and employs a two-stage semantics grounding to activate relevant concept dimensions while suppressing irrelevant ones. A unified learning scheme with structure sharing and self-distillation enables mutual enhancement between latent and lexicon branches, achieving state-of-the-art results on MSR-VTT and DiDeMo. The approach offers a scalable path to accurate, fast cross-modal retrieval by combining coarse global representations with fine-grained semantic concepts in a principled, learnable framework.

Abstract

In video-text retrieval, most existing methods adopt the dual-encoder architecture for fast retrieval, which employs two individual encoders to extract global latent representations for videos and texts. However, they face challenges in capturing fine-grained semantic concepts. In this work, we propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics and combines the strengths of latent and lexicon representations for video-text retrieval. Specifically, we map videos and texts into a pre-defined lexicon space, where each dimension corresponds to a semantic concept. A two-stage semantics grounding approach is proposed to activate semantically relevant dimensions and suppress irrelevant dimensions. The learned lexicon representations can thus reflect fine-grained semantics of videos and texts. Furthermore, to leverage the complementarity between latent and lexicon representations, we propose a unified learning scheme to facilitate mutual learning via structure sharing and self-distillation. Experimental results show our UNIFY framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

TL;DR

This work addresses the challenge of fine-grained semantic alignment in video-text retrieval by introducing UNIFY, a framework that unifies global latent representations with explicit lexicon representations. It maps both videos and texts into a shared lexicon space and employs a two-stage semantics grounding to activate relevant concept dimensions while suppressing irrelevant ones. A unified learning scheme with structure sharing and self-distillation enables mutual enhancement between latent and lexicon branches, achieving state-of-the-art results on MSR-VTT and DiDeMo. The approach offers a scalable path to accurate, fast cross-modal retrieval by combining coarse global representations with fine-grained semantic concepts in a principled, learnable framework.

Abstract

In video-text retrieval, most existing methods adopt the dual-encoder architecture for fast retrieval, which employs two individual encoders to extract global latent representations for videos and texts. However, they face challenges in capturing fine-grained semantic concepts. In this work, we propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics and combines the strengths of latent and lexicon representations for video-text retrieval. Specifically, we map videos and texts into a pre-defined lexicon space, where each dimension corresponds to a semantic concept. A two-stage semantics grounding approach is proposed to activate semantically relevant dimensions and suppress irrelevant dimensions. The learned lexicon representations can thus reflect fine-grained semantics of videos and texts. Furthermore, to leverage the complementarity between latent and lexicon representations, we propose a unified learning scheme to facilitate mutual learning via structure sharing and self-distillation. Experimental results show our UNIFY framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.
Paper Structure (17 sections, 15 equations, 6 figures, 5 tables)

This paper contains 17 sections, 15 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of latent and lexicon representations. The dimensions of latent representations have no explicit meanings. In contrast, each dimension of lexicon representations corresponds to a semantic concept, where semantically relevant dimensions are activated (e.g. woman and dog) while semantically irrelevant dimensions are suppressed (e.g. cat and cup).
  • Figure 2: Overview of our proposed UNIFY framework. The whole model consists of two streams for video and text respectively, each including a stem encoder, two representation-specific encoders and two projection heads. For lexicon representation learning, we propose a two-stage semantics grounding approach (Section \ref{['subsec:two_stage']}). Furthermore, we unify the latent and lexicon representations via structure sharing and self-distillation (Section \ref{['subsec:unified_learning']}). VTC stands for video-text contrastive learning.
  • Figure 3: Zero-shot results of latent and lexicon representations on UCF101.
  • Figure 4: Retrieval results of two queries using latent and lexicon representations. Each row presents the top-5 ranked videos.
  • Figure 5: Top-10 activated lexicon dimensions of four variants of UNIFY-Lexicon. Words that are semantically irrelevant to the video (or text) are highlighted in red color.
  • ...and 1 more figures