Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

Haowei Liu; Yaya Shi; Haiyang Xu; Chunfeng Yuan; Qinghao Ye; Chenliang Li; Ming Yan; Ji Zhang; Fei Huang; Bing Li; Weiming Hu

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu

TL;DR

This work addresses the challenge of fine-grained semantic alignment in video-text retrieval by introducing UNIFY, a framework that unifies global latent representations with explicit lexicon representations. It maps both videos and texts into a shared lexicon space and employs a two-stage semantics grounding to activate relevant concept dimensions while suppressing irrelevant ones. A unified learning scheme with structure sharing and self-distillation enables mutual enhancement between latent and lexicon branches, achieving state-of-the-art results on MSR-VTT and DiDeMo. The approach offers a scalable path to accurate, fast cross-modal retrieval by combining coarse global representations with fine-grained semantic concepts in a principled, learnable framework.

Abstract

In video-text retrieval, most existing methods adopt the dual-encoder architecture for fast retrieval, which employs two individual encoders to extract global latent representations for videos and texts. However, they face challenges in capturing fine-grained semantic concepts. In this work, we propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics and combines the strengths of latent and lexicon representations for video-text retrieval. Specifically, we map videos and texts into a pre-defined lexicon space, where each dimension corresponds to a semantic concept. A two-stage semantics grounding approach is proposed to activate semantically relevant dimensions and suppress irrelevant dimensions. The learned lexicon representations can thus reflect fine-grained semantics of videos and texts. Furthermore, to leverage the complementarity between latent and lexicon representations, we propose a unified learning scheme to facilitate mutual learning via structure sharing and self-distillation. Experimental results show our UNIFY framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

TL;DR

Abstract

Paper Structure (17 sections, 15 equations, 6 figures, 5 tables)

This paper contains 17 sections, 15 equations, 6 figures, 5 tables.

Introduction
Related Work
Video-Text Retrieval
Lexicon Representation
Method
Overview
Two-stage Semantics Grounding
Unified Learning of Latent and Lexicon Representations
Experiments
Experimental Setup
Main Results
Complementarity between Latent and Lexicon Representations
Two-stage Semantics Grounding Ablation
Unified Learning Scheme Ablation
Conclusion
...and 2 more sections

Figures (6)

Figure 1: Comparison of latent and lexicon representations. The dimensions of latent representations have no explicit meanings. In contrast, each dimension of lexicon representations corresponds to a semantic concept, where semantically relevant dimensions are activated (e.g. woman and dog) while semantically irrelevant dimensions are suppressed (e.g. cat and cup).
Figure 2: Overview of our proposed UNIFY framework. The whole model consists of two streams for video and text respectively, each including a stem encoder, two representation-specific encoders and two projection heads. For lexicon representation learning, we propose a two-stage semantics grounding approach (Section \ref{['subsec:two_stage']}). Furthermore, we unify the latent and lexicon representations via structure sharing and self-distillation (Section \ref{['subsec:unified_learning']}). VTC stands for video-text contrastive learning.
Figure 3: Zero-shot results of latent and lexicon representations on UCF101.
Figure 4: Retrieval results of two queries using latent and lexicon representations. Each row presents the top-5 ranked videos.
Figure 5: Top-10 activated lexicon dimensions of four variants of UNIFY-Lexicon. Words that are semantically irrelevant to the video (or text) are highlighted in red color.
...and 1 more figures

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

TL;DR

Abstract

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (6)