Table of Contents
Fetching ...

DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

Yue Jiang, Jichu Li, Yang Liu, Dingkang Yang, Feng Zhou, Quyu Kong

TL;DR

DanmakuTPPBench addresses the lack of multi-modal benchmarks for temporal point processes by introducing DanmakuTPP-Events and DanmakuTPP-QA, derived from synchronized Danmaku comments and video frames on Bilibili. A five-agent, LLM-driven data construction pipeline enables ground-truth generation for diverse temporal-textual-visual reasoning tasks, with 10 evaluation tasks spanning open- and closed-ended QA. Extensive experiments show clear performance gaps for both classical TPP models and current LLMs/MLLMs in multi-modal temporal reasoning, while also establishing strong baselines and highlighting model scaling and finetuning effects. The benchmark has potential to drive deeper integration of temporal reasoning into multi-modal language models, advancing practical understanding of complex event dynamics in video-rich contexts.

Abstract

We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. Project page: https://github.com/FRENKIE-CHIANG/DanmakuTPPBench

DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

TL;DR

DanmakuTPPBench addresses the lack of multi-modal benchmarks for temporal point processes by introducing DanmakuTPP-Events and DanmakuTPP-QA, derived from synchronized Danmaku comments and video frames on Bilibili. A five-agent, LLM-driven data construction pipeline enables ground-truth generation for diverse temporal-textual-visual reasoning tasks, with 10 evaluation tasks spanning open- and closed-ended QA. Extensive experiments show clear performance gaps for both classical TPP models and current LLMs/MLLMs in multi-modal temporal reasoning, while also establishing strong baselines and highlighting model scaling and finetuning effects. The benchmark has potential to drive deeper integration of temporal reasoning into multi-modal language models, advancing practical understanding of complex event dynamics in video-rich contexts.

Abstract

We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. Project page: https://github.com/FRENKIE-CHIANG/DanmakuTPPBench

Paper Structure

This paper contains 13 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Introduction to Danmaku TPP data. (a) Comparison between conventional video viewing and Danmaku viewing experience. Danmaku appears as overlaid text messages at specific timestamps during video playback, creating a multi-modal TPP. The example shows comments from an esports video with timestamps. (b) Danmaku event types identified in our dataset.
  • Figure 2: Statistics of DanmakuTPP-Events dataset. (a) The proportion of TPP data for video topics. (b) Distribution of video durations. (c) Distribution of Danmaku event count.
  • Figure 3: Multi-agent framework for automated construction of DanmakuTPP-QA. The framework consists of five main components: (1) DanmakuTPP-Events (top left) containing synchronized video frames, timestamps, and user comments; (2) Task-Design Agent employing a reasoning LLM to generate diverse evaluation tasks; (3) Annotation Agent Group extracting object tags, image captions, sentiment polarity, and event types; (4) Quality-Control Agent implementing consensus strategies to refine annotations through majority voting and gap filling; (5) Task-Solve Agent Group solving the designed tasks based on multi-modal inputs. This framework enables the creation of DanmakuTPP-QA covering multiple tasks with ground truths.
  • Figure 4: Evaluations of TPP models: (a) Conventional TPP Models on the DanmakuTPP-Events dataset; (b) LLM-based Evaluation of LLMs and MLLMs on DanmakuTPP-QA open-ended TPP Questions. The correctness of answers is scored from 0 to 1 by Qwen3-235B-A22B hui2024qwen25coder.
  • Figure 5: Prompt template for Task-design Agent and its corresponding outputs.
  • ...and 4 more figures