Table of Contents
Fetching ...

Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Young-Jun Lee, Dokyong Lee, Junyoung Youn, Kyeongjin Oh, Byungsoo Ko, Jonghwan Hyeon, Ho-Jin Choi

TL;DR

Stark introduces a large-scale, socially grounded long-term multi-modal conversation dataset and a multi-stage framework (Mcu) to distill long-duration dialogues from LLMs. It couples demographic-grounded persona generation, a virtual face, persona commonsense, personal narratives, event sequences, and pre-stored images with a Plan-and-Execute image aligner to produce coherent, image-sharing conversations across sessions. Trained on Stark, the Ultron 7B model demonstrates strong dialogue-to-image retrieval performance, surpassing several baselines and suggesting Stark’s utility for building persistent, persona-aware AI assistants. The work also discusses limitations around image consistency and emphasizes ethical considerations in generated content and biases, offering a practical resource for advances in multi-modal, long-range human-AI interaction.

Abstract

Humans share a wide variety of images related to their personal experiences within conversations via instant messaging tools. However, existing works focus on (1) image-sharing behavior in singular sessions, leading to limited long-term social interaction, and (2) a lack of personalized image-sharing behavior. In this work, we introduce Stark, a large-scale long-term multi-modal conversation dataset that covers a wide range of social personas in a multi-modality format, time intervals, and images. To construct Stark automatically, we propose a novel multi-modal contextualization framework, Mcu, that generates long-term multi-modal dialogue distilled from ChatGPT and our proposed Plan-and-Execute image aligner. Using our Stark, we train a multi-modal conversation model, Ultron 7B, which demonstrates impressive visual imagination ability. Furthermore, we demonstrate the effectiveness of our dataset in human evaluation. We make our source code and dataset publicly available.

Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

TL;DR

Stark introduces a large-scale, socially grounded long-term multi-modal conversation dataset and a multi-stage framework (Mcu) to distill long-duration dialogues from LLMs. It couples demographic-grounded persona generation, a virtual face, persona commonsense, personal narratives, event sequences, and pre-stored images with a Plan-and-Execute image aligner to produce coherent, image-sharing conversations across sessions. Trained on Stark, the Ultron 7B model demonstrates strong dialogue-to-image retrieval performance, surpassing several baselines and suggesting Stark’s utility for building persistent, persona-aware AI assistants. The work also discusses limitations around image consistency and emphasizes ethical considerations in generated content and biases, offering a practical resource for advances in multi-modal, long-range human-AI interaction.

Abstract

Humans share a wide variety of images related to their personal experiences within conversations via instant messaging tools. However, existing works focus on (1) image-sharing behavior in singular sessions, leading to limited long-term social interaction, and (2) a lack of personalized image-sharing behavior. In this work, we introduce Stark, a large-scale long-term multi-modal conversation dataset that covers a wide range of social personas in a multi-modality format, time intervals, and images. To construct Stark automatically, we propose a novel multi-modal contextualization framework, Mcu, that generates long-term multi-modal dialogue distilled from ChatGPT and our proposed Plan-and-Execute image aligner. Using our Stark, we train a multi-modal conversation model, Ultron 7B, which demonstrates impressive visual imagination ability. Furthermore, we demonstrate the effectiveness of our dataset in human evaluation. We make our source code and dataset publicly available.
Paper Structure (57 sections, 11 figures, 6 tables)

This paper contains 57 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: An overview of Mcu and an example of Stark . At the top, our framework takes basic demographic information (i.e., age, gender, birthplace, residence) and generates a long-term multi-modal conversation. At the bottom, our Stark includes various information such as user's appearance, social persona, persona commonsense, personal narrative, a collection of pre-stored device images, temporal event sequences, and multi-modal dialogue. In this figure, a short sentence between two events indicates the user's episodic experience between those events (e.g., "felling rejuvenated").
  • Figure 2: An illustration of our Plan-and-Execute image aligner process.
  • Figure 3: The ratio (%) of Top-10 device image categories in Stark .
  • Figure 4: The distribution of year and time interval in Stark.
  • Figure 5: Results of head-to-head comparison between Stark (ours) and two existing datasets, DialogCC lee2024dialogcc and MMDialog feng2022mmdialog, on six evaluation criteria.
  • ...and 6 more figures