Table of Contents
Fetching ...

PicPersona-TOD : A Dataset for Personalizing Utterance Style in Task-Oriented Dialogue with Image Persona

Jihyun Lee, Yejin Jeon, Seungyeon Seo, Gary Geunbae Lee

TL;DR

PicPersona-TOD advances task-oriented dialogue by introducing an image-based vision persona to personalize system responses. It delivers an automated five-stage data-generation pipeline that aligns user images with dialogue, transfers utterance style, and uses first-impression prompts plus retrieval-augmented knowledge from Google Maps and Wikipedia to reduce hallucinations. The authors present Pictor, a vision-language NLG baseline that demonstrates strong personalization and generalizes to unseen domains, while maintaining core TOD capabilities such as DST and policy inference. Human evaluations show enhanced user experience and personalization quality, underscoring the practical impact of multimodal personas for engaging TOD interactions, with careful filtering and ethical considerations. Overall, PicPersona-TOD enables more natural, context-aware TOD interactions and provides a solid foundation for future multimodal personalization research in dialogue systems.

Abstract

Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests through natural language interactions, yet existing systems often produce generic, monotonic responses that lack individuality and fail to adapt to users' personal attributes. To address this, we introduce PicPersona-TOD, a novel dataset that incorporates user images as part of the persona, enabling personalized responses tailored to user-specific factors such as age or emotional context. This is facilitated by first impressions, dialogue policy-guided prompting, and the use of external knowledge to reduce hallucinations. Human evaluations confirm that our dataset enhances user experience, with personalized responses contributing to a more engaging interaction. Additionally, we introduce a new NLG model, Pictor, which not only personalizes responses, but also demonstrates robust performance across unseen domains https://github.com/JihyunLee1/PicPersona.

PicPersona-TOD : A Dataset for Personalizing Utterance Style in Task-Oriented Dialogue with Image Persona

TL;DR

PicPersona-TOD advances task-oriented dialogue by introducing an image-based vision persona to personalize system responses. It delivers an automated five-stage data-generation pipeline that aligns user images with dialogue, transfers utterance style, and uses first-impression prompts plus retrieval-augmented knowledge from Google Maps and Wikipedia to reduce hallucinations. The authors present Pictor, a vision-language NLG baseline that demonstrates strong personalization and generalizes to unseen domains, while maintaining core TOD capabilities such as DST and policy inference. Human evaluations show enhanced user experience and personalization quality, underscoring the practical impact of multimodal personas for engaging TOD interactions, with careful filtering and ethical considerations. Overall, PicPersona-TOD enables more natural, context-aware TOD interactions and provides a solid foundation for future multimodal personalization research in dialogue systems.

Abstract

Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests through natural language interactions, yet existing systems often produce generic, monotonic responses that lack individuality and fail to adapt to users' personal attributes. To address this, we introduce PicPersona-TOD, a novel dataset that incorporates user images as part of the persona, enabling personalized responses tailored to user-specific factors such as age or emotional context. This is facilitated by first impressions, dialogue policy-guided prompting, and the use of external knowledge to reduce hallucinations. Human evaluations confirm that our dataset enhances user experience, with personalized responses contributing to a more engaging interaction. Additionally, we introduce a new NLG model, Pictor, which not only personalizes responses, but also demonstrates robust performance across unseen domains https://github.com/JihyunLee1/PicPersona.

Paper Structure

This paper contains 36 sections, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Example of PicPersona-TOD: Unlike existing TOD datasets (in grey), which lack user personas and personalization, PicPersona-TOD uses user images to generate tailored responses.
  • Figure 2: An overview of the automatic pipeline for generating PicPersona-TOD dataset.
  • Figure 3: Visualizations of personalization strength and personalization direction filtering processes are shown on the left and right, respectively.
  • Figure 4: Examples of filtered-out results: style strength filtering (left) and style direction filtering (right).
  • Figure 5: Lexical analysis of PicPersona-TOD. "Word level" refers to the required years of education needed to understand, while the politeness score represents the average use of politeness strategies.
  • ...and 13 more figures