Investigation for Relative Voice Impression Estimation

Kenichi Fujita; Yusuke Ijima

Investigation for Relative Voice Impression Estimation

Kenichi Fujita, Yusuke Ijima

TL;DR

This work formalizes relative voice impression estimation (RIE) to quantify perceptual shifts between two utterances from the same speaker as a low-dimensional vector $r_{rel}$. It systematically compares three modeling paradigms—openSMILE-based classical features, self-supervised speech representations (notably HuBERT), and zero-shot multimodal LLMs—revealing that SSL representations best capture subtle, dynamic within-speaker impression changes, while current LLMs struggle on fine-grained pairwise judgments. The findings highlight the strength of SSL-based approaches for perceptual modeling and identify clear directions for improving MLLMs, including few-shot adaptation and multi-speaker extension. Practical implications include better tools for voice-acting evaluation, TTS expressivity control, and perceptual quality assessment in single-speaker datasets.

Abstract

Paralinguistic and non-linguistic aspects of speech strongly influence listener impressions. While most research focuses on absolute impression scoring, this study investigates relative voice impression estimation (RIE), a framework for predicting the perceptual difference between two utterances from the same speaker. The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright''). To isolate expressive and prosodic variation, we used recordings of a professional speaker reading a text in various styles. We compare three modeling approaches: classical acoustic features commonly used for speech emotion recognition, self-supervised speech representations, and multimodal large language models (MLLMs). Our results demonstrate that models using self-supervised representations outperform methods with classical acoustic features, particularly in capturing complex and dynamic impressions (e.g., ``Cold--Warm'') where classical features fail. In contrast, current MLLMs prove unreliable for this fine-grained pairwise task. This study provides the first systematic investigation of RIE and demonstrates the strength of self-supervised speech models in capturing subtle perceptual variations.

Investigation for Relative Voice Impression Estimation

TL;DR

This work formalizes relative voice impression estimation (RIE) to quantify perceptual shifts between two utterances from the same speaker as a low-dimensional vector

. It systematically compares three modeling paradigms—openSMILE-based classical features, self-supervised speech representations (notably HuBERT), and zero-shot multimodal LLMs—revealing that SSL representations best capture subtle, dynamic within-speaker impression changes, while current LLMs struggle on fine-grained pairwise judgments. The findings highlight the strength of SSL-based approaches for perceptual modeling and identify clear directions for improving MLLMs, including few-shot adaptation and multi-speaker extension. Practical implications include better tools for voice-acting evaluation, TTS expressivity control, and perceptual quality assessment in single-speaker datasets.

Abstract

Paper Structure (13 sections, 1 equation, 3 figures, 6 tables)

This paper contains 13 sections, 1 equation, 3 figures, 6 tables.

Introduction
Dataset and impression representation
Dataset
Impression difference vector
Method and experimental conditions
Classical acoustic feature-based method
SSL-based method
MLLM-based method
Results
Estimation with classical acoustic feature-based methods
Estimation with SSL-based methods
Estimation with MLLM-based methods
Conclusion

Figures (3)

Figure 1: Overview of relative voice impression estimation. The system takes two utterances and outputs a vector of perceptual differences along predefined dimensions.
Figure 2: Overview of the SSL-based estimation.
Figure 3: Overview of the MLLM-based estimation.

Investigation for Relative Voice Impression Estimation

TL;DR

Abstract

Investigation for Relative Voice Impression Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)