Table of Contents
Fetching ...

GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features

Yunzhuo Sun, Yifang Xu, Zien Xie, Yukun Shu, Sidan Du

TL;DR

A novel two-stage model that takes the output of LLMs as the input to the second-stage transformer encoder-decoder, and by using only span anchors and similarity scores as outputs, positioning accuracy outperforms traditional methods, like Moment-DETR.

Abstract

Moment retrieval (MR) and highlight detection (HD) aim to identify relevant moments and highlights in video from corresponding natural language query. Large language models (LLMs) have demonstrated proficiency in various computer vision tasks. However, existing methods for MR\&HD have not yet been integrated with LLMs. In this letter, we propose a novel two-stage model that takes the output of LLMs as the input to the second-stage transformer encoder-decoder. First, MiniGPT-4 is employed to generate the detailed description of the video frame and rewrite the query statement, fed into the encoder as new features. Then, semantic similarity is computed between the generated description and the rewritten queries. Finally, continuous high-similarity video frames are converted into span anchors, serving as prior position information for the decoder. Experiments demonstrate that our approach achieves a state-of-the-art result, and by using only span anchors and similarity scores as outputs, positioning accuracy outperforms traditional methods, like Moment-DETR.

GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features

TL;DR

A novel two-stage model that takes the output of LLMs as the input to the second-stage transformer encoder-decoder, and by using only span anchors and similarity scores as outputs, positioning accuracy outperforms traditional methods, like Moment-DETR.

Abstract

Moment retrieval (MR) and highlight detection (HD) aim to identify relevant moments and highlights in video from corresponding natural language query. Large language models (LLMs) have demonstrated proficiency in various computer vision tasks. However, existing methods for MR\&HD have not yet been integrated with LLMs. In this letter, we propose a novel two-stage model that takes the output of LLMs as the input to the second-stage transformer encoder-decoder. First, MiniGPT-4 is employed to generate the detailed description of the video frame and rewrite the query statement, fed into the encoder as new features. Then, semantic similarity is computed between the generated description and the rewritten queries. Finally, continuous high-similarity video frames are converted into span anchors, serving as prior position information for the decoder. Experiments demonstrate that our approach achieves a state-of-the-art result, and by using only span anchors and similarity scores as outputs, positioning accuracy outperforms traditional methods, like Moment-DETR.
Paper Structure (13 sections, 5 equations, 3 figures, 4 tables)

This paper contains 13 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (a) Describe video frame content with GPT. (b) Examples of video moment retrieval and highlight detection (MR&HD) tasks. (c) Rewrite queries using GPT. (d) Calculate the cosine similarity between the image description and the rewritten query.
  • Figure 2: An overview of our proposed model GPTSee. Video frames and query text are initially fed into MiniGPT4, generating corresponding image content descriptions and semantically rewritten queries. Subsequently, the visual extractor ${E}_v$ and text extractor ${E}_t$ obtain features from these descriptions and rewritten queries, which are input into their respective visual and text encoders. In parallel, similarity scores are calculated based on the semantic similarity between the image content descriptions of key video frames and rewritten queries. The visual encoder jointly receives these scores, concatenated with the image description. The encoded visual and text features interact through a cross-attention mechanism, resulting in the cross-modal features $F_{vt}$. This feature is then directly processed by an FFN to derive the highlight scores for the HD task. Frames bearing consecutive high similarity scores form a range, referred to as span anchors, serving as prior position information for the moment decoder. Subsequently, for the MR task, this decoder establishes the start and end positions of video moments.
  • Figure 3: Determine the threshold and then collect the indices of all frames whose similarity scores exceed this threshold.