Table of Contents
Fetching ...

Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

Yiyang Jiang, Wengyu Zhang, Xulu Zhang, Xiaoyong Wei, Chang Wen Chen, Qing Li

TL;DR

This paper tackles video moment retrieval by addressing the limitations of LLMs as decoders for fine-grained, frame-level salience. It proposes using LLM encoders to refine inter-concept relations within multimodal embeddings and introduces two plug-in strategies: semantic refinement and pseudo-event regulation, forming a general, modular VMR framework. The authors demonstrate that LLM encoders can refine relationships across CLIP, BLIP, and T5 embeddings and that pseudo-events provide a principled target for moment localization, achieving strong, cross-dataset results and transfers to multiple existing VMR architectures. Overall, the work advances fine-grained video understanding by integrating pretrained language models as representation refiners and temporal priors, with practical implications for more accurate and robust VMR systems.

Abstract

In this paper, we investigate the feasibility of leveraging large language models (LLMs) for integrating general knowledge and incorporating pseudo-events as priors for temporal content distribution in video moment retrieval (VMR) models. The motivation behind this study arises from the limitations of using LLMs as decoders for generating discrete textual descriptions, which hinders their direct application to continuous outputs like salience scores and inter-frame embeddings that capture inter-frame relations. To overcome these limitations, we propose utilizing LLM encoders instead of decoders. Through a feasibility study, we demonstrate that LLM encoders effectively refine inter-concept relations in multimodal embeddings, even without being trained on textual embeddings. We also show that the refinement capability of LLM encoders can be transferred to other embeddings, such as BLIP and T5, as long as these embeddings exhibit similar inter-concept similarity patterns to CLIP embeddings. We present a general framework for integrating LLM encoders into existing VMR architectures, specifically within the fusion module. Through experimental validation, we demonstrate the effectiveness of our proposed methods by achieving state-of-the-art performance in VMR. The source code can be accessed at https://github.com/fletcherjiang/LLMEPET.

Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

TL;DR

This paper tackles video moment retrieval by addressing the limitations of LLMs as decoders for fine-grained, frame-level salience. It proposes using LLM encoders to refine inter-concept relations within multimodal embeddings and introduces two plug-in strategies: semantic refinement and pseudo-event regulation, forming a general, modular VMR framework. The authors demonstrate that LLM encoders can refine relationships across CLIP, BLIP, and T5 embeddings and that pseudo-events provide a principled target for moment localization, achieving strong, cross-dataset results and transfers to multiple existing VMR architectures. Overall, the work advances fine-grained video understanding by integrating pretrained language models as representation refiners and temporal priors, with practical implications for more accurate and robust VMR systems.

Abstract

In this paper, we investigate the feasibility of leveraging large language models (LLMs) for integrating general knowledge and incorporating pseudo-events as priors for temporal content distribution in video moment retrieval (VMR) models. The motivation behind this study arises from the limitations of using LLMs as decoders for generating discrete textual descriptions, which hinders their direct application to continuous outputs like salience scores and inter-frame embeddings that capture inter-frame relations. To overcome these limitations, we propose utilizing LLM encoders instead of decoders. Through a feasibility study, we demonstrate that LLM encoders effectively refine inter-concept relations in multimodal embeddings, even without being trained on textual embeddings. We also show that the refinement capability of LLM encoders can be transferred to other embeddings, such as BLIP and T5, as long as these embeddings exhibit similar inter-concept similarity patterns to CLIP embeddings. We present a general framework for integrating LLM encoders into existing VMR architectures, specifically within the fusion module. Through experimental validation, we demonstrate the effectiveness of our proposed methods by achieving state-of-the-art performance in VMR. The source code can be accessed at https://github.com/fletcherjiang/LLMEPET.
Paper Structure (24 sections, 11 equations, 15 figures, 6 tables)

This paper contains 24 sections, 11 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: The proportions of improved and deteriorated triplets after the refinement (a--c), and the inter-concept similarity matrices of the concept embeddings before and after the refinement (d--f): in (a) and (d), CLIP is used as the textual embeddings, while BLIP is used for (b) and (e), and T5 is used for (c) and (f).
  • Figure 2: Proportions of Improved and Deteriorated Triplets over the contribution of non-textual embeddings controlled by the $\alpha$ and the degree of alignment between the textual and non-textual embeddings controlled by the distortion probability $p$.
  • Figure 3: The impact of utilizing specific layers from the LLM encoder for relation refinement. The performance of individual layers ($4^{th}$, $8^{th}$, and $32^{nd}$) as well as combined layers ($14^{th}$ to $17^{th}$) have been studied.
  • Figure 4: The proposed general framework for VMR with the proposed prior knowledge integration components.
  • Figure 5: Illustration of the effectiveness of using the LLM as a relation refiner. The predictions with the LLM encoder are better aligned with the ground truth. The model without the refiner focuses more on the visually dominate concepts (e.g., girls, guys), while with the refiner, contextual concepts (e.g., traveling, crossing-street) can be further incorporated.
  • ...and 10 more figures