On the relationship between speech and hearing
Srinivasan Umesh, Leon Cohen, Douglas Nelson
TL;DR
This work introduces a speech-scale framework that maps physical frequency to a warped domain via a universal warping function $g(f)$, such that spectra of perceptually identical sounds from different speakers align up to a speaker-dependent translation. By modeling the relation between spectra with a piecewise-linear function and estimating the warp from real speech data, the authors demonstrate that the warped spectra collapse onto a common pattern, validating the coupling between speech production and hearing. The speech-scale is shown to resemble the Mel scale, with a fitted form $oldsymbol{ exteta_{ ext{speech}}}=2478.24 ext{log}igl(1+rac{f}{641.94}igr)$ closely matching $oldsymbol{ exteta_{ ext{MEL}}}=2595 ext{log}_{10}igl(1+rac{f}{700}igr)$, and cross-speaker alignment of formants (e.g., vowel /AW/) further supports a shared perceptual-production representation. The findings suggest a fundamental link between production-based and hearing-based frequency representations, connecting speech, hearing, and basilar-membrane place maps through closely related scales.
Abstract
We present a framework for experimentally linking speech production and hearing. Using this approach, we describe experimental results, that lead to the concept that sounds made by different individuals and perceived to be the same can be transformed into each other by a "speech scale". The speech scale is empirically determined using only speech data. We show the similarity of the speech scale to the MEL scale of Stevens and Volkmann, which was derived only from hearing experiments. We thus experimentally link speech production and hearing.
