Table of Contents
Fetching ...

Beyond Discrete Personas: Personality Modeling Through Journal Intensive Conversations

Sayantan Pal, Souvik Das, Rohini K. Srihari

TL;DR

This work tackles the problem of static personas in personalized dialogue by introducing Journal Intensive Conversations (JIC), a large-scale dataset built from Reddit journal entries and enhanced with Big Five personality alignment. It combines journal clustering and convergence filtering with retrieval-augmented fine-tuning and LoRA-based parameter efficiency to generate and fine-tune models (notably LLaMA 3 8B) that produce more personality-driven, coherent conversations. The paper reports an 11% improvement in capturing personality traits, validated through automatic metrics and the LM Eval Harness, and demonstrates the benefits and limitations of data scaling, retrieval, and model baselines. It also emphasizes safety and ethics, including toxicity analysis and principled handling of biased trait representations, aiming to enable more authentic human-AI interactions in real-world settings.

Abstract

Large Language Models (LLMs) have significantly improved personalized conversational capabilities. However, existing datasets like Persona Chat, Synthetic Persona Chat, and Blended Skill Talk rely on static, predefined personas. This approach often results in dialogues that fail to capture human personalities' fluid and evolving nature. To overcome these limitations, we introduce a novel dataset with around 400,000 dialogues and a framework for generating personalized conversations using long-form journal entries from Reddit. Our approach clusters journal entries for each author and filters them by selecting the most representative cluster, ensuring that the retained entries best reflect the author's personality. We further refine the data by capturing the Big Five personality traits --openness, conscientiousness, extraversion, agreeableness, and neuroticism --ensuring that dialogues authentically reflect an individual's personality. Using Llama 3 70B, we generate high-quality, personality-rich dialogues grounded in these journal entries. Fine-tuning models on this dataset leads to an 11% improvement in capturing personality traits on average, outperforming existing approaches in generating more coherent and personality-driven dialogues.

Beyond Discrete Personas: Personality Modeling Through Journal Intensive Conversations

TL;DR

This work tackles the problem of static personas in personalized dialogue by introducing Journal Intensive Conversations (JIC), a large-scale dataset built from Reddit journal entries and enhanced with Big Five personality alignment. It combines journal clustering and convergence filtering with retrieval-augmented fine-tuning and LoRA-based parameter efficiency to generate and fine-tune models (notably LLaMA 3 8B) that produce more personality-driven, coherent conversations. The paper reports an 11% improvement in capturing personality traits, validated through automatic metrics and the LM Eval Harness, and demonstrates the benefits and limitations of data scaling, retrieval, and model baselines. It also emphasizes safety and ethics, including toxicity analysis and principled handling of biased trait representations, aiming to enable more authentic human-AI interactions in real-world settings.

Abstract

Large Language Models (LLMs) have significantly improved personalized conversational capabilities. However, existing datasets like Persona Chat, Synthetic Persona Chat, and Blended Skill Talk rely on static, predefined personas. This approach often results in dialogues that fail to capture human personalities' fluid and evolving nature. To overcome these limitations, we introduce a novel dataset with around 400,000 dialogues and a framework for generating personalized conversations using long-form journal entries from Reddit. Our approach clusters journal entries for each author and filters them by selecting the most representative cluster, ensuring that the retained entries best reflect the author's personality. We further refine the data by capturing the Big Five personality traits --openness, conscientiousness, extraversion, agreeableness, and neuroticism --ensuring that dialogues authentically reflect an individual's personality. Using Llama 3 70B, we generate high-quality, personality-rich dialogues grounded in these journal entries. Fine-tuning models on this dataset leads to an 11% improvement in capturing personality traits on average, outperforming existing approaches in generating more coherent and personality-driven dialogues.

Paper Structure

This paper contains 31 sections, 4 equations, 8 figures, 20 tables, 2 algorithms.

Figures (8)

  • Figure 1: LLMs fine-tuned on our JIC dataset best aligns to the golden annotation capturing Personality traits compared to other models (image shows Ft. on Persona Chat). Detailed dialogue is shown in Fig. \ref{['fig:main']}.
  • Figure 2: The synthetic data generation process is outlined in five distinct stages (left side). On the right side, we demonstrate how dialogues are generated from journal entries, highlighting the personality traits they reflect and align with. In Stage 3, where personality trait filtering is introduced, the initial values of the $\alpha$ and $\beta$ parameters were set to None to allow extensive data generation before further refinement.
  • Figure 3: Model Training and Inference Settings
  • Figure 4: Performance of LLaMA(left) and Mistral(right) models across various JIC dataset splits. Reported: BLEU, METEOR, ROUGE-L, Avg(across all: Table \ref{['tab:basic_JIC']}).
  • Figure 5: Performance of LLaMA and Mistral models across various JIC dataset splits. The left panel displays the results for LLaMA, while the right panel shows the results for Mistral.
  • ...and 3 more figures