Table of Contents
Fetching ...

REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, Francesco Barbieri

TL;DR

The paper presents RealTalk, a real-world, long-term dialogue dataset to evaluate enduring open-domain conversations and emotional intelligence. It analyzes differences between authentic human dialogues and LLM-simulated ones, focusing on EI attributes and persona consistency, and introduces two benchmarks—persona simulation and memory probing—to push toward more human-like, memory-aware AI. Key findings show real interactions display diverse emotions and gradual intimacy development, whereas LLMs often exhibit constrained EI and excessive empathy; fine-tuning on individual chat histories improves persona emulation but long-term memory remains a major challenge. This work provides a valuable benchmark and data resource for advancing personalized, memory-aware conversational systems in the wild.

Abstract

Long-term, open-domain dialogue capabilities are essential for chatbots aiming to recall past interactions and demonstrate emotional intelligence (EI). Yet, most existing research relies on synthetic, LLM-generated data, leaving open questions about real-world conversational patterns. To address this gap, we introduce REALTALK, a 21-day corpus of authentic messaging app dialogues, providing a direct benchmark against genuine human interactions. We first conduct a dataset analysis, focusing on EI attributes and persona consistency to understand the unique challenges posed by real-world dialogues. By comparing with LLM-generated conversations, we highlight key differences, including diverse emotional expressions and variations in persona stability that synthetic dialogues often fail to capture. Building on these insights, we introduce two benchmark tasks: (1) persona simulation where a model continues a conversation on behalf of a specific user given prior dialogue context; and (2) memory probing where a model answers targeted questions requiring long-term memory of past interactions. Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation. Additionally, existing models face significant challenges in recalling and leveraging long-term context within real-world conversations.

REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

TL;DR

The paper presents RealTalk, a real-world, long-term dialogue dataset to evaluate enduring open-domain conversations and emotional intelligence. It analyzes differences between authentic human dialogues and LLM-simulated ones, focusing on EI attributes and persona consistency, and introduces two benchmarks—persona simulation and memory probing—to push toward more human-like, memory-aware AI. Key findings show real interactions display diverse emotions and gradual intimacy development, whereas LLMs often exhibit constrained EI and excessive empathy; fine-tuning on individual chat histories improves persona emulation but long-term memory remains a major challenge. This work provides a valuable benchmark and data resource for advancing personalized, memory-aware conversational systems in the wild.

Abstract

Long-term, open-domain dialogue capabilities are essential for chatbots aiming to recall past interactions and demonstrate emotional intelligence (EI). Yet, most existing research relies on synthetic, LLM-generated data, leaving open questions about real-world conversational patterns. To address this gap, we introduce REALTALK, a 21-day corpus of authentic messaging app dialogues, providing a direct benchmark against genuine human interactions. We first conduct a dataset analysis, focusing on EI attributes and persona consistency to understand the unique challenges posed by real-world dialogues. By comparing with LLM-generated conversations, we highlight key differences, including diverse emotional expressions and variations in persona stability that synthetic dialogues often fail to capture. Building on these insights, we introduce two benchmark tasks: (1) persona simulation where a model continues a conversation on behalf of a specific user given prior dialogue context; and (2) memory probing where a model answers targeted questions requiring long-term memory of past interactions. Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation. Additionally, existing models face significant challenges in recalling and leveraging long-term context within real-world conversations.

Paper Structure

This paper contains 1 section, 1 figure, 1 table.

Table of Contents

  1. Introduction

Figures (1)

  • Figure 1: A Motivation Example. LLM-simulated dialogues often exhibit excessive empathy, even when discussing negative topics, whereas real-world human dialogues demonstrate a broader emotional spectrum, incorporate reflective and grounding language, and progressively develop intimacy over time.