Table of Contents
Fetching ...

Investigation for Relative Voice Impression Estimation

Kenichi Fujita, Yusuke Ijima

TL;DR

This work formalizes relative voice impression estimation (RIE) to quantify perceptual shifts between two utterances from the same speaker as a low-dimensional vector $r_{rel}$. It systematically compares three modeling paradigms—openSMILE-based classical features, self-supervised speech representations (notably HuBERT), and zero-shot multimodal LLMs—revealing that SSL representations best capture subtle, dynamic within-speaker impression changes, while current LLMs struggle on fine-grained pairwise judgments. The findings highlight the strength of SSL-based approaches for perceptual modeling and identify clear directions for improving MLLMs, including few-shot adaptation and multi-speaker extension. Practical implications include better tools for voice-acting evaluation, TTS expressivity control, and perceptual quality assessment in single-speaker datasets.

Abstract

Paralinguistic and non-linguistic aspects of speech strongly influence listener impressions. While most research focuses on absolute impression scoring, this study investigates relative voice impression estimation (RIE), a framework for predicting the perceptual difference between two utterances from the same speaker. The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright''). To isolate expressive and prosodic variation, we used recordings of a professional speaker reading a text in various styles. We compare three modeling approaches: classical acoustic features commonly used for speech emotion recognition, self-supervised speech representations, and multimodal large language models (MLLMs). Our results demonstrate that models using self-supervised representations outperform methods with classical acoustic features, particularly in capturing complex and dynamic impressions (e.g., ``Cold--Warm'') where classical features fail. In contrast, current MLLMs prove unreliable for this fine-grained pairwise task. This study provides the first systematic investigation of RIE and demonstrates the strength of self-supervised speech models in capturing subtle perceptual variations.

Investigation for Relative Voice Impression Estimation

TL;DR

This work formalizes relative voice impression estimation (RIE) to quantify perceptual shifts between two utterances from the same speaker as a low-dimensional vector . It systematically compares three modeling paradigms—openSMILE-based classical features, self-supervised speech representations (notably HuBERT), and zero-shot multimodal LLMs—revealing that SSL representations best capture subtle, dynamic within-speaker impression changes, while current LLMs struggle on fine-grained pairwise judgments. The findings highlight the strength of SSL-based approaches for perceptual modeling and identify clear directions for improving MLLMs, including few-shot adaptation and multi-speaker extension. Practical implications include better tools for voice-acting evaluation, TTS expressivity control, and perceptual quality assessment in single-speaker datasets.

Abstract

Paralinguistic and non-linguistic aspects of speech strongly influence listener impressions. While most research focuses on absolute impression scoring, this study investigates relative voice impression estimation (RIE), a framework for predicting the perceptual difference between two utterances from the same speaker. The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright''). To isolate expressive and prosodic variation, we used recordings of a professional speaker reading a text in various styles. We compare three modeling approaches: classical acoustic features commonly used for speech emotion recognition, self-supervised speech representations, and multimodal large language models (MLLMs). Our results demonstrate that models using self-supervised representations outperform methods with classical acoustic features, particularly in capturing complex and dynamic impressions (e.g., ``Cold--Warm'') where classical features fail. In contrast, current MLLMs prove unreliable for this fine-grained pairwise task. This study provides the first systematic investigation of RIE and demonstrates the strength of self-supervised speech models in capturing subtle perceptual variations.
Paper Structure (13 sections, 1 equation, 3 figures, 6 tables)

This paper contains 13 sections, 1 equation, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of relative voice impression estimation. The system takes two utterances and outputs a vector of perceptual differences along predefined dimensions.
  • Figure 2: Overview of the SSL-based estimation.
  • Figure 3: Overview of the MLLM-based estimation.