Table of Contents
Fetching ...

The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling

Yuxin Lu, Yu-Ying Chuang, R. Harald Baayen

TL;DR

The study shows that semantics meaningfully shapes the fine-grained pitch contours of disyllabic Mandarin in spontaneous Taiwan Mandarin, using GAM analyses and a theory-driven Discriminative Lexicon Model (DLM). It demonstrates that word identity and especially sense-type robustly predict f0 contours, often surpassing the canonical tone pattern effects. Building on this, the authors demonstrate that token-specific pitch contours can be predicted from context-specific contextualized embeddings via a simple linear mapping, with the strongest results when using abstracted centroid representations. Collectively, the findings challenge traditional views of tone realization as largely determined by fixed tonal sequences and highlight a learnable, semantics-driven mapping from meaning to phonetic realization in natural speech.

Abstract

A growing body of literature has demonstrated that semantics can co-determine fine phonetic detail. However, the complex interplay between phonetic realization and semantics remains understudied, particularly in pitch realization. The current study investigates the tonal realization of Mandarin disyllabic words with all 20 possible combinations of two tones, as found in a corpus of Taiwan Mandarin spontaneous speech. We made use of Generalized Additive Mixed Models (GAMs) to model f0 contours as a function of a series of predictors, including gender, tonal context, tone pattern, speech rate, word position, bigram probability, speaker and word. In the GAM analysis, word and sense emerged as crucial predictors of f0 contours, with effect sizes that exceed those of tone pattern. For each word token in our dataset, we then obtained a contextualized embedding by applying the GPT-2 large language model to the context of that token in the corpus. We show that the pitch contours of word tokens can be predicted to a considerable extent from these contextualized embeddings, which approximate token-specific meanings in contexts of use. The results of our corpus study show that meaning in context and phonetic realization are far more entangled than standard linguistic theory predicts.

The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling

TL;DR

The study shows that semantics meaningfully shapes the fine-grained pitch contours of disyllabic Mandarin in spontaneous Taiwan Mandarin, using GAM analyses and a theory-driven Discriminative Lexicon Model (DLM). It demonstrates that word identity and especially sense-type robustly predict f0 contours, often surpassing the canonical tone pattern effects. Building on this, the authors demonstrate that token-specific pitch contours can be predicted from context-specific contextualized embeddings via a simple linear mapping, with the strongest results when using abstracted centroid representations. Collectively, the findings challenge traditional views of tone realization as largely determined by fixed tonal sequences and highlight a learnable, semantics-driven mapping from meaning to phonetic realization in natural speech.

Abstract

A growing body of literature has demonstrated that semantics can co-determine fine phonetic detail. However, the complex interplay between phonetic realization and semantics remains understudied, particularly in pitch realization. The current study investigates the tonal realization of Mandarin disyllabic words with all 20 possible combinations of two tones, as found in a corpus of Taiwan Mandarin spontaneous speech. We made use of Generalized Additive Mixed Models (GAMs) to model f0 contours as a function of a series of predictors, including gender, tonal context, tone pattern, speech rate, word position, bigram probability, speaker and word. In the GAM analysis, word and sense emerged as crucial predictors of f0 contours, with effect sizes that exceed those of tone pattern. For each word token in our dataset, we then obtained a contextualized embedding by applying the GPT-2 large language model to the context of that token in the corpus. We show that the pitch contours of word tokens can be predicted to a considerable extent from these contextualized embeddings, which approximate token-specific meanings in contexts of use. The results of our corpus study show that meaning in context and phonetic realization are far more entangled than standard linguistic theory predicts.

Paper Structure

This paper contains 19 sections, 11 figures, 15 tables.

Figures (11)

  • Figure 1: A selection of tokens in spoken Taiwan Mandarin. Left panel: six tokens representing six different word types, all sharing the tone pattern T4-T2 (a falling tone followed by a rising tone). The tokens are 後來 (hou4lai2, 'later'), 幸福 (xing4fu2, 'happiness'), 去年 (qu4nian2,'last year'), 不能 (bu4neng2,'cannot'), 自然 (zi4ran2,'nature'), 問題 (wen4ti2,'problem'). Right panel: four tokens representing the word type 幸福 (xing4fu2, 'happiness'). All f0 contours shown here are produced by the same speaker.
  • Figure 2: The increase in AIC scores when a predictor is withheld from the best-fit model. The AIC increase when word or tone_pattern is withheld is shown in red, and the increase for other predictors is shown in blue. Panels 1 to 4 represent four GAMs with tonal contexts 4.4, 3.4, 4.1, and 4.0, respectively.
  • Figure 3: Concurvity scores for selected terms in the four GAMs. The concurvity scores for word and tone_pattern are shown in red, and those for other predictors are shown in blue. From left to right, it presents tonal context 3.4, 4.0, 4.1, and 4.4 respectively. Concurvity scores were calculated based on the best-fit GAMs with all predictors included.
  • Figure 4: The partial effect of general smooth for the normalized_time for female and male speakers, in different tone contexts. The orange curves indicate the general contours for female speakers, and the blue curves indicate the general contours for male speakers. Vertical grey dashed lines indicate the average syllable boundary, and the horizontal grey dashed line represents the y=0 reference line.
  • Figure 5: The effect of tone pattern. The blue curves represent the partial effects of the factor smooth for tone_pattern, combined with the general smooth of normalized_t for female speakers, based on the best-fit models that include the word effect. There is one GAM for each tonal context, resulting in four blue curves representing, in a given panel, the four tonal contexts. The red curves present the mean f0 contours of a tone pattern, calculated by averaging the four f0 contours across the tonal contexts. Thus, the blue curves in each panel illustrate how the tonal context modulates the general curve shown in red. Vertical grey dashed lines indicate the average syllable boundary, and the horizontal grey dashed line represents the y=0 reference line.
  • ...and 6 more figures