Table of Contents
Fetching ...

Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Ming Li, Wenxin Liang, Yang Li, Sidan Du

TL;DR

Moment-GPT is proposed, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs that substantially outperforms the state-of-the-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.

Abstract

The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply VideoChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-ofthe-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.

Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

TL;DR

Moment-GPT is proposed, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs that substantially outperforms the state-of-the-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.

Abstract

The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply VideoChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-ofthe-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.
Paper Structure (48 sections, 5 equations, 8 figures, 17 tables)

This paper contains 48 sections, 5 equations, 8 figures, 17 tables.

Figures (8)

  • Figure 1: (a) Illustration of video moment retrieval (VMR). The query containing language bias results in erroneous localization. (b) MLLM-based method demand complex multi-stage fine-tuning using vast volumes of annotated data. (c) Zero-shot method FVLM-2023 cannot avoid performance degradation caused by language bias. (d) Our proposed Moment-GPT harnesses the video comprehension capabilities of MLLMs without further fine-tuning. It also utilizes LLM to reduce bias, enhancing overall accuracy.
  • Figure 2: The overall architecture of Moment-GPT. It first utilizes LLaMA-3 to reduce language bias (Sec. \ref{['subsec:query_debiasing']}). Next, construct candidate spans by MiniGPT-v2, frame scorer, and span generator (Sec. \ref{['subsec:generate_candidates']}). Finally, select the most relevant spans using Video-ChatGPT, span scorer, and NMS (Sec. \ref{['subsec:select_spans']}).
  • Figure 3: Reduce language bias in raw query via LLaMA-3. Bold, italics, and colored fonts are utilized only for presentation and are not employed in the code.
  • Figure 4: (a) Image captioning via MiniGPT-v2. (b) Video captioning via Video-ChatGPT. Frame_N and Span_N are just for demonstration convenience and do not exist in reality.
  • Figure 5: Qualitative results on Charades-STA (top) and ActivityNet-Captions (bottom). We mark all biased and rewritten words in red.
  • ...and 3 more figures