Table of Contents
Fetching ...

PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs

Sho Inoue, Shai Wang, Haizhou Li

TL;DR

This work addresses the absence of personality annotations in speech data by building a fully-duplex dialog pipeline that converts raw two-channel audio into richly labeled conversations with timestamps, laughter, emotion, sentiment, and response types. It leverages Whisper-based transcripts and GPT-4o to classify backchannels and predict Big Five personality traits, integrating textual, acoustic, and interactional cues. Human evaluations demonstrate that the proposed approach aligns more closely with human judgments than baselines, validating the effectiveness of emotion/sentiment, laughter, and interjection cues in personality inference. The framework enables context-sensitive, personality-aware conversational agents and points to future work on synthetic datasets and personality-conditioned dialogue systems with practical impact for user-adaptive AI systems.

Abstract

Despite significant progress in neural spoken dialog systems, personality-aware conversation agents -- capable of adapting behavior based on personalities -- remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.

PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs

TL;DR

This work addresses the absence of personality annotations in speech data by building a fully-duplex dialog pipeline that converts raw two-channel audio into richly labeled conversations with timestamps, laughter, emotion, sentiment, and response types. It leverages Whisper-based transcripts and GPT-4o to classify backchannels and predict Big Five personality traits, integrating textual, acoustic, and interactional cues. Human evaluations demonstrate that the proposed approach aligns more closely with human judgments than baselines, validating the effectiveness of emotion/sentiment, laughter, and interjection cues in personality inference. The framework enables context-sensitive, personality-aware conversational agents and points to future work on synthetic datasets and personality-conditioned dialogue systems with practical impact for user-adaptive AI systems.

Abstract

Despite significant progress in neural spoken dialog systems, personality-aware conversation agents -- capable of adapting behavior based on personalities -- remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.

Paper Structure

This paper contains 17 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Diagrams of: (a) Overall PersonaTAB Pipeline; (b) Dataset Preprocessing from Two-Channel Speech Dialog Data; (c) Personality Prediction from Speaker Attributes using Large Language Models (LLMs).
  • Figure 2: Visualization of examples of (a) Turns and Overlaps; (b) Laugh Token Integration; (c) Sentence Concatenation from Word-Level Time Stamps.