Table of Contents
Fetching ...

Auditory Representation Effective for Estimating Vocal Tract Information

Toshio Irino, Shintaro Doan

Abstract

We can estimate the size of the speakers based on their speech sounds alone. We had proposed an auditory computational theory of the Stabilised Wavelet-Mellin Transform (SWMT), which segregates information about the size and shape of the vocal tract and glottal vibration, to explain this observation. It has been shown that the auditory representation or excitation pattern (EP) associated with a weighting function based on the SWMT, termed the ``SSI weight,'' can account for the psychometric functions of size perception. In this study, we investigated whether EP with SSI weight can accurately estimate vocal tract lengths (VTLs) which were measured by magnetic resonance imaging (MRI) in male and female subjects. It was found that the use of SSI weight significantly improved the VTL estimation. Furthermore, the estimation errors in the EP with the SSI weight were significantly smaller than those in the commonly used spectra derived from the Fourier transform, Mel filterbank, and WORLD vocoder. It was also shown that the SSI weight can be easily introduced into these spectra to improve the performance.

Auditory Representation Effective for Estimating Vocal Tract Information

Abstract

We can estimate the size of the speakers based on their speech sounds alone. We had proposed an auditory computational theory of the Stabilised Wavelet-Mellin Transform (SWMT), which segregates information about the size and shape of the vocal tract and glottal vibration, to explain this observation. It has been shown that the auditory representation or excitation pattern (EP) associated with a weighting function based on the SWMT, termed the ``SSI weight,'' can account for the psychometric functions of size perception. In this study, we investigated whether EP with SSI weight can accurately estimate vocal tract lengths (VTLs) which were measured by magnetic resonance imaging (MRI) in male and female subjects. It was found that the use of SSI weight significantly improved the VTL estimation. Furthermore, the estimation errors in the EP with the SSI weight were significantly smaller than those in the commonly used spectra derived from the Fourier transform, Mel filterbank, and WORLD vocoder. It was also shown that the SSI weight can be easily introduced into these spectra to improve the performance.
Paper Structure (22 sections, 4 equations, 8 figures, 1 table)

This paper contains 22 sections, 4 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Excitation patterns (EPs, i.e., GCFB outputs) of the vowel 'a' of a male (a) and a female (b) and their cross-correlation function (c). The horizontal axis for (a) and (b) is the number of GCFB channels. The vertical axis represents the EP level. In (c), the horizontal axis represents the amount of shift in terms of the channel and the vertical axis represents the correlation function value (arbitrary unit). Blue solid line: EP; black dotted line: SSI weight (see Section \ref{['sec:F0adaptiveWeightingFunc']}); red dashed line: SSI-weighted EP. Circle (o) and asterisk (*) in (c) indicate the peaks in the cross correlations of the EPs and SSIweighted EPs, respectively.
  • Figure 2: Size-Shape Image and SSI weight (adapted from irino2017auditory). (a) SSI of the vowel 'o' irino2002segregating. The horizontal axis represents the product of the time interval and the peak frequency of the auditory filter. The vertical axis is the peak frequency of the auditory filter equally spaced on the $\rm ERB_N number$ axis. (b) Weighting function based on active region of SSI. (see also Fig. \ref{['fig:SWMT_SWT_SSIweight']} of Appendix A).
  • Figure 3: Scatter plot of estimated VTL ratio against VTL measured from MRI data. The regression line was obtained using all the vowel and speaker data (dashed lines). Left: $Ep$, right: $Ep_{SSI}$ with $h_{max} = 3.5$. Color coded for each vowel. Dotted line: 1:1 identical line.
  • Figure 4: Correlation coefficients between measured and estimated VTLs for various $h_{max}$ values. Lines: Coefficients for the individual vowel ('a','i','u','e','o') and when using all vowels ("All"). Bar: mean value for the five vowels.
  • Figure 5: Correlation coefficients between the measured VTLs and the VTLs estimated from the log and power compressed spectrum. Left panel shows the result when using the original compressed spectrum and the right panel shows the result when using the spectrum with the SSI weight. Lines and bars are the same as in Fig. \ref{['fig:EffectSSIhmax']}.
  • ...and 3 more figures