Table of Contents
Fetching ...

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

Houlun Chen, Xin Wang, Hong Chen, Zeyang Zhang, Wei Feng, Bin Huang, Jia Jia, Wenwu Zhu

TL;DR

A more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus with other partially matched candidates is proposed, and a more challenging fine-grained VCMR benchmark containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG is constructed which demonstrate a high level of annotation quality.

Abstract

Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained understanding, which hinders precise video moment localization when given fine-grained queries. In this paper, we propose a more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus with other partially matched candidates. To improve the dataset construction efficiency and guarantee high-quality data annotations, we propose VERIFIED, an automatic \underline{V}id\underline{E}o-text annotation pipeline to generate captions with \underline{R}el\underline{I}able \underline{FI}n\underline{E}-grained statics and \underline{D}ynamics. Specifically, we resort to large language models (LLM) and large multimodal models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules to generate diverse fine-grained captions for each video. To filter out the inaccurate annotations caused by the LLM hallucination, we propose a Fine-Granularity Aware Noise Evaluator where we fine-tune a video foundation model with disturbed hard-negatives augmented contrastive and matching losses. With VERIFIED, we construct a more challenging fine-grained VCMR benchmark containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG which demonstrate a high level of annotation quality. We evaluate several state-of-the-art VCMR models on the proposed dataset, revealing that there is still significant scope for fine-grained video understanding in VCMR. Code and Datasets are in \href{https://github.com/hlchen23/VERIFIED}{https://github.com/hlchen23/VERIFIED}.

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

TL;DR

A more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus with other partially matched candidates is proposed, and a more challenging fine-grained VCMR benchmark containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG is constructed which demonstrate a high level of annotation quality.

Abstract

Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained understanding, which hinders precise video moment localization when given fine-grained queries. In this paper, we propose a more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus with other partially matched candidates. To improve the dataset construction efficiency and guarantee high-quality data annotations, we propose VERIFIED, an automatic \underline{V}id\underline{E}o-text annotation pipeline to generate captions with \underline{R}el\underline{I}able \underline{FI}n\underline{E}-grained statics and \underline{D}ynamics. Specifically, we resort to large language models (LLM) and large multimodal models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules to generate diverse fine-grained captions for each video. To filter out the inaccurate annotations caused by the LLM hallucination, we propose a Fine-Granularity Aware Noise Evaluator where we fine-tune a video foundation model with disturbed hard-negatives augmented contrastive and matching losses. With VERIFIED, we construct a more challenging fine-grained VCMR benchmark containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG which demonstrate a high level of annotation quality. We evaluate several state-of-the-art VCMR models on the proposed dataset, revealing that there is still significant scope for fine-grained video understanding in VCMR. Code and Datasets are in \href{https://github.com/hlchen23/VERIFIED}{https://github.com/hlchen23/VERIFIED}.

Paper Structure

This paper contains 12 sections, 11 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: (a) Previous VCMR, where a query may be coarse and there are many potential positive moments (green) that are not annotated, making the ground truth annotations unreasonable. (b) Our Challenging Fine-Grained VCMR, where a more fine-grained query is given and the method needs to retrieve the best matched one from partially matched candidates (pink). (c) Our VERIFIED pipeline generates fine-grained annotations with reliable static (green) and dynamic (blue) details.
  • Figure 2: Our VERIFIED annotation pipeline includes two independent modules: Statics Enhanced Captioning (A) and Dynamics Enhanced Captioning (B), which generate multiple fine-grained caption candidates with static and dynamic details. Additionally, we develop a Fine-Granularity Aware Noise Evaluator (C) that generates and selects the best disturbed positive and negative samples to fine-tune UMT using hard-negative augmented contrastive and matching losses. This evaluator grades captions, assigning low confidence scores to inaccurate ones.
  • Figure 3: Visualization of the effectiveness of our VERIFIED pipeline. (1-3) are selected from fine-grained ActivityNet-FIG, Charades-FIG, and DiDeMo-FIG, respectively. The fine-grained static and dynamic content is marked in green and blue, and inaccurate content is marked in red.
  • Figure 4: Visualization of impressive cases. (1) Our annotation captures the interaction between the dog and its handler and movement trajectory. (2) Our annotation captures details of the throwing objects and conveys that the man throws them many times. (3) Our annotation reads the textual information from visual content and expresses the correct order of used ingredients.
  • Figure 5: XML's predictions in Charades-FIG with different granularities of training data.