Table of Contents
Fetching ...

IVCR-200K: A Large-Scale Multi-turn Dialogue Benchmark for Interactive Video Corpus Retrieval

Ning Han, Yawen Zeng, Shaohua Long, Chengqing Li, Sijie Yang, Dun Tan, Jianfeng Dong, Jingjing Chen

TL;DR

The paper tackles the gap in video retrieval by introducing an interactive, multi-turn paradigm for video corpus and moment retrieval. It delivers IVCR-200K, a bilingual, high-quality dataset, and a framework called InterLLaVA that combines fast video retrieval with multimodal language modeling to support explainable, dialog-driven interactions. Through extensive experiments, it shows that multi-turn dialogue enhances retrieval and localization performance and analyzes robustness to data and module variations. The work paves the way for personalized, conversational video search that can adapt to user intent and provide transparent reasoning behind results.

Abstract

In recent years, significant developments have been made in both video retrieval and video moment retrieval tasks, which respectively retrieve complete videos or moments for a given text query. These advancements have greatly improved user satisfaction during the search process. However, previous work has failed to establish meaningful "interaction" between the retrieval system and the user, and its one-way retrieval paradigm can no longer fully meet the personalization and dynamic needs of at least 80.8\% of users. In this paper, we introduce the Interactive Video Corpus Retrieval (IVCR) task, a more realistic setting that enables multi-turn, conversational, and realistic interactions between the user and the retrieval system. To facilitate research on this challenging task, we introduce IVCR-200K, a high-quality, bilingual, multi-turn, conversational, and abstract semantic dataset that supports video retrieval and even moment retrieval. Furthermore, we propose a comprehensive framework based on multi-modal large language models (MLLMs) to help users interact in several modes with more explainable solutions. The extensive experiments demonstrate the effectiveness of our dataset and framework.

IVCR-200K: A Large-Scale Multi-turn Dialogue Benchmark for Interactive Video Corpus Retrieval

TL;DR

The paper tackles the gap in video retrieval by introducing an interactive, multi-turn paradigm for video corpus and moment retrieval. It delivers IVCR-200K, a bilingual, high-quality dataset, and a framework called InterLLaVA that combines fast video retrieval with multimodal language modeling to support explainable, dialog-driven interactions. Through extensive experiments, it shows that multi-turn dialogue enhances retrieval and localization performance and analyzes robustness to data and module variations. The work paves the way for personalized, conversational video search that can adapt to user intent and provide transparent reasoning behind results.

Abstract

In recent years, significant developments have been made in both video retrieval and video moment retrieval tasks, which respectively retrieve complete videos or moments for a given text query. These advancements have greatly improved user satisfaction during the search process. However, previous work has failed to establish meaningful "interaction" between the retrieval system and the user, and its one-way retrieval paradigm can no longer fully meet the personalization and dynamic needs of at least 80.8\% of users. In this paper, we introduce the Interactive Video Corpus Retrieval (IVCR) task, a more realistic setting that enables multi-turn, conversational, and realistic interactions between the user and the retrieval system. To facilitate research on this challenging task, we introduce IVCR-200K, a high-quality, bilingual, multi-turn, conversational, and abstract semantic dataset that supports video retrieval and even moment retrieval. Furthermore, we propose a comprehensive framework based on multi-modal large language models (MLLMs) to help users interact in several modes with more explainable solutions. The extensive experiments demonstrate the effectiveness of our dataset and framework.

Paper Structure

This paper contains 15 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Visualization of the video retrieval, moment retrieval, and our multi-turn interactive retrieval.
  • Figure 2: Investigation of User Search Behavior, Feedback, and Interaction Turns in ShareGPT. Users demonstrate a pronounced inclination towards interactive search and harbor high expectations regarding interaction rounds.
  • Figure 3: The pipeline of the dataset collection.
  • Figure 4: Distribution of question lengths, answer lengths, and dialogue lengths.
  • Figure 5: Distribution of turn lengths, video lengths, and moment lengths.
  • ...and 3 more figures