Table of Contents
Fetching ...

On the relationship between speech and hearing

Srinivasan Umesh, Leon Cohen, Douglas Nelson

TL;DR

This work introduces a speech-scale framework that maps physical frequency to a warped domain via a universal warping function $g(f)$, such that spectra of perceptually identical sounds from different speakers align up to a speaker-dependent translation. By modeling the relation between spectra with a piecewise-linear function and estimating the warp from real speech data, the authors demonstrate that the warped spectra collapse onto a common pattern, validating the coupling between speech production and hearing. The speech-scale is shown to resemble the Mel scale, with a fitted form $oldsymbol{ exteta_{ ext{speech}}}=2478.24 ext{log}igl(1+ rac{f}{641.94}igr)$ closely matching $oldsymbol{ exteta_{ ext{MEL}}}=2595 ext{log}_{10}igl(1+ rac{f}{700}igr)$, and cross-speaker alignment of formants (e.g., vowel /AW/) further supports a shared perceptual-production representation. The findings suggest a fundamental link between production-based and hearing-based frequency representations, connecting speech, hearing, and basilar-membrane place maps through closely related scales.

Abstract

We present a framework for experimentally linking speech production and hearing. Using this approach, we describe experimental results, that lead to the concept that sounds made by different individuals and perceived to be the same can be transformed into each other by a "speech scale". The speech scale is empirically determined using only speech data. We show the similarity of the speech scale to the MEL scale of Stevens and Volkmann, which was derived only from hearing experiments. We thus experimentally link speech production and hearing.

On the relationship between speech and hearing

TL;DR

This work introduces a speech-scale framework that maps physical frequency to a warped domain via a universal warping function , such that spectra of perceptually identical sounds from different speakers align up to a speaker-dependent translation. By modeling the relation between spectra with a piecewise-linear function and estimating the warp from real speech data, the authors demonstrate that the warped spectra collapse onto a common pattern, validating the coupling between speech production and hearing. The speech-scale is shown to resemble the Mel scale, with a fitted form closely matching , and cross-speaker alignment of formants (e.g., vowel /AW/) further supports a shared perceptual-production representation. The findings suggest a fundamental link between production-based and hearing-based frequency representations, connecting speech, hearing, and basilar-membrane place maps through closely related scales.

Abstract

We present a framework for experimentally linking speech production and hearing. Using this approach, we describe experimental results, that lead to the concept that sounds made by different individuals and perceived to be the same can be transformed into each other by a "speech scale". The speech scale is empirically determined using only speech data. We show the similarity of the speech scale to the MEL scale of Stevens and Volkmann, which was derived only from hearing experiments. We thus experimentally link speech production and hearing.
Paper Structure (5 sections, 10 equations, 5 figures)

This paper contains 5 sections, 10 equations, 5 figures.

Figures (5)

  • Figure 1: The figure shows the formants from four different hypothetical speakers. The horizontal axis is real frequency, $f$ measured in Hertz.
  • Figure 2: Each of the spectra in Fig. 1 is transformed according to the following function $\nu=0.9\log(f)+0.6(\log(f))^{2}$ and plotted respectively. The horizontal axis now is $\nu$. In the new domain, $\nu$, the spectra are identical except for a translation factor. See next figure.
  • Figure 3: This plot confirms that after aligning the formants they are indeed identical.
  • Figure 4: The figure shows the Speech-Scale, Stevens and Volkmann data and Békésy's data. The Speech-scale has been obtained empirically from actual speech data. The Steven and Volkmann data have been obtained from psycho-physiological study, while the Békésy data has been obtained from experiments on the basilar membrane. The fact that all the three curves are similar shows a strong connection between speech and hearing.
  • Figure :