Investigation for Relative Voice Impression Estimation
Kenichi Fujita, Yusuke Ijima
TL;DR
This work formalizes relative voice impression estimation (RIE) to quantify perceptual shifts between two utterances from the same speaker as a low-dimensional vector $r_{rel}$. It systematically compares three modeling paradigms—openSMILE-based classical features, self-supervised speech representations (notably HuBERT), and zero-shot multimodal LLMs—revealing that SSL representations best capture subtle, dynamic within-speaker impression changes, while current LLMs struggle on fine-grained pairwise judgments. The findings highlight the strength of SSL-based approaches for perceptual modeling and identify clear directions for improving MLLMs, including few-shot adaptation and multi-speaker extension. Practical implications include better tools for voice-acting evaluation, TTS expressivity control, and perceptual quality assessment in single-speaker datasets.
Abstract
Paralinguistic and non-linguistic aspects of speech strongly influence listener impressions. While most research focuses on absolute impression scoring, this study investigates relative voice impression estimation (RIE), a framework for predicting the perceptual difference between two utterances from the same speaker. The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright''). To isolate expressive and prosodic variation, we used recordings of a professional speaker reading a text in various styles. We compare three modeling approaches: classical acoustic features commonly used for speech emotion recognition, self-supervised speech representations, and multimodal large language models (MLLMs). Our results demonstrate that models using self-supervised representations outperform methods with classical acoustic features, particularly in capturing complex and dynamic impressions (e.g., ``Cold--Warm'') where classical features fail. In contrast, current MLLMs prove unreliable for this fine-grained pairwise task. This study provides the first systematic investigation of RIE and demonstrates the strength of self-supervised speech models in capturing subtle perceptual variations.
