PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs
Sho Inoue, Shai Wang, Haizhou Li
TL;DR
This work addresses the absence of personality annotations in speech data by building a fully-duplex dialog pipeline that converts raw two-channel audio into richly labeled conversations with timestamps, laughter, emotion, sentiment, and response types. It leverages Whisper-based transcripts and GPT-4o to classify backchannels and predict Big Five personality traits, integrating textual, acoustic, and interactional cues. Human evaluations demonstrate that the proposed approach aligns more closely with human judgments than baselines, validating the effectiveness of emotion/sentiment, laughter, and interjection cues in personality inference. The framework enables context-sensitive, personality-aware conversational agents and points to future work on synthetic datasets and personality-conditioned dialogue systems with practical impact for user-adaptive AI systems.
Abstract
Despite significant progress in neural spoken dialog systems, personality-aware conversation agents -- capable of adapting behavior based on personalities -- remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.
