Table of Contents
Fetching ...

MTPChat: A Multimodal Time-Aware Persona Dataset for Conversational Agents

Wanqi Yang, Yanda Li, Meng Fang, Ling Chen

TL;DR

This work addresses the lack of temporal reasoning in persona-grounded multimodal dialogue by introducing MTPChat, a dataset that injects time-aware dynamics into both conversations and grounding memories. It couples this dataset with two novel tasks, $TNRP$ and $TGMP$, and a framework featuring an Adaptive Temporal Module to dynamically fuse linguistic and visual streams with temporal context. Experimental results show that MTPChat presents genuine temporal reasoning challenges and that the ATM-based framework consistently outperforms baseline multimodal fusion approaches, especially when memories are present. The dataset and methods advance time-aware conversational AI, enabling models to track evolving persona memories and dialogue states over time for more coherent and contextually grounded interactions.

Abstract

Understanding temporal dynamics is critical for conversational agents, enabling effective content analysis and informed decision-making. However, time-aware datasets, particularly for persona-grounded conversations, are still limited, which narrows their scope and diminishes their complexity. To address this gap, we introduce MTPChat, a multimodal, time-aware persona dialogue dataset that integrates linguistic, visual, and temporal elements within dialogue and persona memory. Leveraging MTPChat, we propose two time-sensitive tasks: Temporal Next Response Prediction (TNRP) and Temporal Grounding Memory Prediction (TGMP), both designed to assess a model's ability to understand implicit temporal cues and dynamic interactions. Additionally, we present an innovative framework featuring an adaptive temporal module to effectively integrate multimodal streams and capture temporal dependencies. Experimental results validate the challenges posed by MTPChat and demonstrate the effectiveness of our framework in multimodal time-sensitive scenarios.

MTPChat: A Multimodal Time-Aware Persona Dataset for Conversational Agents

TL;DR

This work addresses the lack of temporal reasoning in persona-grounded multimodal dialogue by introducing MTPChat, a dataset that injects time-aware dynamics into both conversations and grounding memories. It couples this dataset with two novel tasks, and , and a framework featuring an Adaptive Temporal Module to dynamically fuse linguistic and visual streams with temporal context. Experimental results show that MTPChat presents genuine temporal reasoning challenges and that the ATM-based framework consistently outperforms baseline multimodal fusion approaches, especially when memories are present. The dataset and methods advance time-aware conversational AI, enabling models to track evolving persona memories and dialogue states over time for more coherent and contextually grounded interactions.

Abstract

Understanding temporal dynamics is critical for conversational agents, enabling effective content analysis and informed decision-making. However, time-aware datasets, particularly for persona-grounded conversations, are still limited, which narrows their scope and diminishes their complexity. To address this gap, we introduce MTPChat, a multimodal, time-aware persona dialogue dataset that integrates linguistic, visual, and temporal elements within dialogue and persona memory. Leveraging MTPChat, we propose two time-sensitive tasks: Temporal Next Response Prediction (TNRP) and Temporal Grounding Memory Prediction (TGMP), both designed to assess a model's ability to understand implicit temporal cues and dynamic interactions. Additionally, we present an innovative framework featuring an adaptive temporal module to effectively integrate multimodal streams and capture temporal dependencies. Experimental results validate the challenges posed by MTPChat and demonstrate the effectiveness of our framework in multimodal time-sensitive scenarios.

Paper Structure

This paper contains 32 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: An example of a multimodal, time-sensitive, persona-grounded scenario, showcasing how the user's dialogue responses evolve over time based on the temporal dynamics of dialogue and episodic memories.
  • Figure 2: Distribution of times across conversations and memories in training, validation, and test set.
  • Figure 3: Overview of the Temporal Next Response Prediction (TNRP) and Temporal Grounding Memory Prediction (TGMP) tasks. The left panel displays a user’s episodic memories, represented as image-sentence-time triplets with various creation dates. The dialogue instance on the right highlights the corresponding response and task setup.
  • Figure 4: Architecture of our framework with Adaptive Temporal Module (ATM).