Table of Contents
Fetching ...

Localizing Events in Videos with Multimodal Queries

Gengyuan Zhang, Mang Ling Ada Fok, Jialu Ma, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu

TL;DR

ICQ is introduced, a new benchmark designed for localizing events in videos with MQs, alongside an evaluation dataset ICQ-Highlight, and 3 Multimodal Query Adaptation methods and a novel Surrogate Fine-Tuning strategy are proposed, serving as strong baseline methods.

Abstract

Localizing events in videos based on semantic queries is a pivotal task in video understanding, with the growing significance of user-oriented applications like video search. Yet, current research predominantly relies on natural language queries (NLQs), overlooking the potential of using multimodal queries (MQs) that integrate images to more flexibly represent semantic queries -- especially when it is difficult to express non-verbal or unfamiliar concepts in words. To bridge this gap, we introduce ICQ, a new benchmark designed for localizing events in videos with MQs, alongside an evaluation dataset ICQ-Highlight. To accommodate and evaluate existing video localization models for this new task, we propose 3 Multimodal Query Adaptation methods and a novel Surrogate Fine-tuning on pseudo-MQs strategy. ICQ systematically benchmarks 12 state-of-the-art backbone models, spanning from specialized video localization models to Video LLMs, across diverse application domains. Our experiments highlight the high potential of MQs in real-world applications. We believe this benchmark is a first step toward advancing MQs in video event localization.

Localizing Events in Videos with Multimodal Queries

TL;DR

ICQ is introduced, a new benchmark designed for localizing events in videos with MQs, alongside an evaluation dataset ICQ-Highlight, and 3 Multimodal Query Adaptation methods and a novel Surrogate Fine-Tuning strategy are proposed, serving as strong baseline methods.

Abstract

Localizing events in videos based on semantic queries is a pivotal task in video understanding, with the growing significance of user-oriented applications like video search. Yet, current research predominantly relies on natural language queries (NLQs), overlooking the potential of using multimodal queries (MQs) that integrate images to more flexibly represent semantic queries -- especially when it is difficult to express non-verbal or unfamiliar concepts in words. To bridge this gap, we introduce ICQ, a new benchmark designed for localizing events in videos with MQs, alongside an evaluation dataset ICQ-Highlight. To accommodate and evaluate existing video localization models for this new task, we propose 3 Multimodal Query Adaptation methods and a novel Surrogate Fine-tuning on pseudo-MQs strategy. ICQ systematically benchmarks 12 state-of-the-art backbone models, spanning from specialized video localization models to Video LLMs, across diverse application domains. Our experiments highlight the high potential of MQs in real-world applications. We believe this benchmark is a first step toward advancing MQs in video event localization.
Paper Structure (34 sections, 11 figures, 10 tables)

This paper contains 34 sections, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Localizing Events in Videos with Semantics Queries. Fig. \ref{['subfig:a']}: So far, the community has only focused on natural language query-based video event localization as in lei2021detecting. Our benchmark ICQ focuses on a more general scenario: localizing events in video with multimodal queries (MQs). Fig. \ref{['subfig:b']}: Localizing video events with MQs has broad applications: users often use brief, ambiguous text queries like "swimming" or struggle to find precise terms when it comes to unfamiliar or abstract concepts. In such cases, MQs —like scribbles or example images— can help.
  • Figure 2: Examples of ICQ-Highlight. Multimodal queries consist of a reference image and a refinement text. We consider 4 different reference image styles: scribble, cartoon, cinematic, and realistic. They describe a target event that corresponds to moments or segments in original videos and are equivalent to natural language queries in the original dataset lei2021detecting. Refinement texts add either complementary information if reference images are minimal like for scribble images, or corrective information if reference images are more complicated.
  • Figure 3: Multimodal Query Adaptation (MQA). We propose 3 MQA methods to bridge the current gap between natural language query-based models and our multimodal query-based benchmark: MQ-Cap, MQ-Sum, and VQ-Enc and MQ-Sum(+SUIT) enhanced by Surrogate Fine-tuning on pseudo-MQs (MQ-Sum(+SUIT)) strategy, to adapt MQs to the conventional NLQ-based backbones.
  • Figure 4: Surrogate Fine-tuning on pseudo-MQs (SUIT). for MQ-Sum. To solve the issue of lacking training data, we propose an automatic pseudo-MQ generation pipeline to construct a "surrogate" dataset for fine-tuning MQ-Sum.
  • Figure 5: Controlled Experiment. We plot the model performance (R1@0.7) on 2 subsets $D_{ret}$ and $D_{gen}$. We use the dashed line to indicate the same performance on both datasets.
  • ...and 6 more figures