Table of Contents
Fetching ...

A Video-grounded Dialogue Dataset and Metric for Event-driven Activities

Wiradee Imrattanatrai, Masaki Asada, Kimihiro Hasegawa, Zhi-Qi Cheng, Ken Fukuda, Teruko Mitamura

TL;DR

This work introduces VDAct, a large-scale dataset for video-grounded dialogue focused on long, event-driven activities, augmented with knowledge-graph–driven scenario summaries. To address evaluation gaps, the authors propose VDEval, a session-based, LLM-driven metric that incorporates entire dialogue history and KG-derived video summaries to better reflect human judgment. Empirical results show that VDAct is notably challenging for current vision-language models, and that VDEval achieves higher correlation with human assessments than traditional turn-based metrics. The combination of VDAct and VDEval provides a more realistic benchmark and evaluation framework for multimodal dialogue systems operating in complex, temporally structured scenarios, with potential for KG-enabled architectures to improve performance.

Abstract

This paper presents VDAct, a dataset for a Video-grounded Dialogue on Event-driven Activities, alongside VDEval, a session-based context evaluation metric specially designed for the task. Unlike existing datasets, VDAct includes longer and more complex video sequences that depict a variety of event-driven activities that require advanced contextual understanding for accurate response generation. The dataset comprises 3,000 dialogues with over 30,000 question-and-answer pairs, derived from 1,000 videos with diverse activity scenarios. VDAct displays a notably challenging characteristic due to its broad spectrum of activity scenarios and wide range of question types. Empirical studies on state-of-the-art vision foundation models highlight their limitations in addressing certain question types on our dataset. Furthermore, VDEval, which integrates dialogue session history and video content summaries extracted from our supplementary Knowledge Graphs to evaluate individual responses, demonstrates a significantly higher correlation with human assessments on the VDAct dataset than existing evaluation metrics that rely solely on the context of single dialogue turns.

A Video-grounded Dialogue Dataset and Metric for Event-driven Activities

TL;DR

This work introduces VDAct, a large-scale dataset for video-grounded dialogue focused on long, event-driven activities, augmented with knowledge-graph–driven scenario summaries. To address evaluation gaps, the authors propose VDEval, a session-based, LLM-driven metric that incorporates entire dialogue history and KG-derived video summaries to better reflect human judgment. Empirical results show that VDAct is notably challenging for current vision-language models, and that VDEval achieves higher correlation with human assessments than traditional turn-based metrics. The combination of VDAct and VDEval provides a more realistic benchmark and evaluation framework for multimodal dialogue systems operating in complex, temporally structured scenarios, with potential for KG-enabled architectures to improve performance.

Abstract

This paper presents VDAct, a dataset for a Video-grounded Dialogue on Event-driven Activities, alongside VDEval, a session-based context evaluation metric specially designed for the task. Unlike existing datasets, VDAct includes longer and more complex video sequences that depict a variety of event-driven activities that require advanced contextual understanding for accurate response generation. The dataset comprises 3,000 dialogues with over 30,000 question-and-answer pairs, derived from 1,000 videos with diverse activity scenarios. VDAct displays a notably challenging characteristic due to its broad spectrum of activity scenarios and wide range of question types. Empirical studies on state-of-the-art vision foundation models highlight their limitations in addressing certain question types on our dataset. Furthermore, VDEval, which integrates dialogue session history and video content summaries extracted from our supplementary Knowledge Graphs to evaluate individual responses, demonstrates a significantly higher correlation with human assessments on the VDAct dataset than existing evaluation metrics that rely solely on the context of single dialogue turns.

Paper Structure

This paper contains 48 sections, 10 figures, 12 tables.

Figures (10)

  • Figure 1: VDAct with example dialogue (turns 1-7) for an activity scenario video with corresponding KG elements for the events. Each input utterance ( i.e., question) is labeled with the relevant question types. Note that the object IDs are omitted from the KG illustration.
  • Figure 2: Statistics on activities as scenarios in the VDAct dataset and percentages of different questions types for sample dialogues in comparison with the AVSD dataset.
  • Figure 3: Comparison of turn-based and session-based contexts for evaluation metrics. The input prompt includes turn- or session-based context with the information of turn $t_i$.
  • Figure 4: The co-occurrence of activity pairs by their activity types within the scenarios. The number in the parentheses behind each activity type indicates the total number of activities. The number in each grid represents a percentage of an activity-type pair that occurred across scenarios.
  • Figure 5: Distribution of questions in VDAct based on their first three words. The size of each arc reflects the frequency of each three-word prefix.
  • ...and 5 more figures