Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge
Young-Jun Lee, Dokyong Lee, Junyoung Youn, Kyeongjin Oh, Byungsoo Ko, Jonghwan Hyeon, Ho-Jin Choi
TL;DR
Stark introduces a large-scale, socially grounded long-term multi-modal conversation dataset and a multi-stage framework (Mcu) to distill long-duration dialogues from LLMs. It couples demographic-grounded persona generation, a virtual face, persona commonsense, personal narratives, event sequences, and pre-stored images with a Plan-and-Execute image aligner to produce coherent, image-sharing conversations across sessions. Trained on Stark, the Ultron 7B model demonstrates strong dialogue-to-image retrieval performance, surpassing several baselines and suggesting Stark’s utility for building persistent, persona-aware AI assistants. The work also discusses limitations around image consistency and emphasizes ethical considerations in generated content and biases, offering a practical resource for advances in multi-modal, long-range human-AI interaction.
Abstract
Humans share a wide variety of images related to their personal experiences within conversations via instant messaging tools. However, existing works focus on (1) image-sharing behavior in singular sessions, leading to limited long-term social interaction, and (2) a lack of personalized image-sharing behavior. In this work, we introduce Stark, a large-scale long-term multi-modal conversation dataset that covers a wide range of social personas in a multi-modality format, time intervals, and images. To construct Stark automatically, we propose a novel multi-modal contextualization framework, Mcu, that generates long-term multi-modal dialogue distilled from ChatGPT and our proposed Plan-and-Execute image aligner. Using our Stark, we train a multi-modal conversation model, Ultron 7B, which demonstrates impressive visual imagination ability. Furthermore, we demonstrate the effectiveness of our dataset in human evaluation. We make our source code and dataset publicly available.
