Table of Contents
Fetching ...

EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model

Guozhang Li, Xinpeng Ding, De Cheng, Jie Li, Nannan Wang, Xinbo Gao

TL;DR

This paper proposes ETC (Expand then Clarify) (Expand then Clarify), first using the additional information to expand the initial incomplete pseudo-temporal boundaries, and subsequently refining these expanded ones to achieve precise boundaries.

Abstract

Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotation, explicit-supervision methods, i.e., generating pseudo-temporal boundaries for training, have achieved great success. However, data augmentations in these methods might disrupt critical temporal information, yielding poor pseudo boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose EtC (Expand then Clarify), first use the additional information to expand the initial incomplete pseudo boundaries, and subsequently refine these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise of expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.

EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model

TL;DR

This paper proposes ETC (Expand then Clarify) (Expand then Clarify), first using the additional information to expand the initial incomplete pseudo-temporal boundaries, and subsequently refining these expanded ones to achieve precise boundaries.

Abstract

Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotation, explicit-supervision methods, i.e., generating pseudo-temporal boundaries for training, have achieved great success. However, data augmentations in these methods might disrupt critical temporal information, yielding poor pseudo boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose EtC (Expand then Clarify), first use the additional information to expand the initial incomplete pseudo boundaries, and subsequently refine these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise of expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.
Paper Structure (18 sections, 7 equations, 5 figures, 17 tables)

This paper contains 18 sections, 7 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: (a) The original implicit supervision methods (b) The original explicit supervision methods with simple pseudo label. (c) The proposed method. The proposal result of description generated by the MLLM, where the proposal will be as complete as possible to cover all content.
  • Figure 2: (a) The temporal pseudo boundary expand module. We utilize MLLM to describe frames within the initial pseudo boundary of the basic WSVG model to refine these expanded ones to guide the basic model to generate a comprehensive one. Then we clarify the expanded pseudo boundaries by (b) the PCL loss which harmonizes a balance between the initial pseudo boundary and expanded boundary and (c) the mutual learning jointly considers the initial pseudo boundaries (incomplete yet clean) and the expanded pseudo boundaries (comprehensive yet noisy.
  • Figure 3: An example of the maximum and minimum normalized Query-Description Matching score (QDM) and Query-Frame Matching score (QFM) within a video of training dataset.
  • Figure 4: Qualitative results of the ground truth (GT), the Baseline model, and the Baseline model with our EtC framework. The first examples are from the Charades-STA dataset, and the last two examples are from the ActivityNet Caption dataset.
  • Figure 5: The sample count at different ratios of the intersection length between pseudo boundaries and GT to the length of GT. Horizontal coordinate denotes the ratios of the intersection length between pseudo boundaries and GT to the length of GT. Vertical coordinate denotes the number of test samples. The baseline model is CPL.