Table of Contents
Fetching ...

LLM4VG: Large Language Models Evaluation for Video Grounding

Wei Feng, Xin Wang, Hong Chen, Zeyang Zhang, Houlun Chen, Zihan Song, Yuwei Zhou, Yuekui Yang, Haiyang Wu, Wenwu Zhu

TL;DR

The LLM4VG benchmark is proposed, which systematically evaluates the performance of different LLMs on video grounding tasks and proposes prompt methods to integrate the instruction of VG and description from different kinds of generators, including caption-based generators for direct visual description and VQA-based generators for information enhancement.

Abstract

Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models. However, the ability of LLMs to handle video grounding (VG), which is an important time-related video task requiring the model to precisely locate the start and end timestamps of temporal moments in videos that match the given textual queries, still remains unclear and unexplored in literature. To fill the gap, in this paper, we propose the LLM4VG benchmark, which systematically evaluates the performance of different LLMs on video grounding tasks. Based on our proposed LLM4VG, we design extensive experiments to examine two groups of video LLM models on video grounding: (i) the video LLMs trained on the text-video pairs (denoted as VidLLM), and (ii) the LLMs combined with pretrained visual description models such as the video/image captioning model. We propose prompt methods to integrate the instruction of VG and description from different kinds of generators, including caption-based generators for direct visual description and VQA-based generators for information enhancement. We also provide comprehensive comparisons of various VidLLMs and explore the influence of different choices of visual models, LLMs, prompt designs, etc, as well. Our experimental evaluations lead to two conclusions: (i) the existing VidLLMs are still far away from achieving satisfactory video grounding performance, and more time-related video tasks should be included to further fine-tune these models, and (ii) the combination of LLMs and visual models shows preliminary abilities for video grounding with considerable potential for improvement by resorting to more reliable models and further guidance of prompt instructions.

LLM4VG: Large Language Models Evaluation for Video Grounding

TL;DR

The LLM4VG benchmark is proposed, which systematically evaluates the performance of different LLMs on video grounding tasks and proposes prompt methods to integrate the instruction of VG and description from different kinds of generators, including caption-based generators for direct visual description and VQA-based generators for information enhancement.

Abstract

Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models. However, the ability of LLMs to handle video grounding (VG), which is an important time-related video task requiring the model to precisely locate the start and end timestamps of temporal moments in videos that match the given textual queries, still remains unclear and unexplored in literature. To fill the gap, in this paper, we propose the LLM4VG benchmark, which systematically evaluates the performance of different LLMs on video grounding tasks. Based on our proposed LLM4VG, we design extensive experiments to examine two groups of video LLM models on video grounding: (i) the video LLMs trained on the text-video pairs (denoted as VidLLM), and (ii) the LLMs combined with pretrained visual description models such as the video/image captioning model. We propose prompt methods to integrate the instruction of VG and description from different kinds of generators, including caption-based generators for direct visual description and VQA-based generators for information enhancement. We also provide comprehensive comparisons of various VidLLMs and explore the influence of different choices of visual models, LLMs, prompt designs, etc, as well. Our experimental evaluations lead to two conclusions: (i) the existing VidLLMs are still far away from achieving satisfactory video grounding performance, and more time-related video tasks should be included to further fine-tune these models, and (ii) the combination of LLMs and visual models shows preliminary abilities for video grounding with considerable potential for improvement by resorting to more reliable models and further guidance of prompt instructions.
Paper Structure (20 sections, 1 equation, 3 figures, 6 tables)

This paper contains 20 sections, 1 equation, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Benchmark of LLM4VG. We analyze the influences of applying six visual description generators, three LLMs, and three prompting methods for video grounding, comparing them with three VidLLMs which are directly instructed to conduct video grounding tasks.
  • Figure 2: Framework of video grounding for LLMs. (a) stands for video grounding with VidLLMs. (b) stands for video grounding with LLMs and visual models. The dashed box represents that in the one-shot method, we will input the exemplar prompt, description prompt, and question prompt, while in the zero-shot method, we will not input the exemplar prompt.
  • Figure 3: Example cases of LLMs conducting video grounding task, (a) and (b) are successful cases, while (c) and (d) are failure cases, since LLMs give the answer 'Based on the given caption, it is not possible to determine the grounding time for the query'. The text with a blue background represents positive for grounding answers, while the text with a red background represents negative for grounding answers, although it might be related to the query.