Table of Contents
Fetching ...

KunquDB: An Attempt for Speaker Verification in the Chinese Opera Scenario

Huali Zhou, Yuke Lin, Dong Liu, Ming Li

TL;DR

KunquDB addresses data scarcity in Chinese opera by introducing a large-scale, richly annotated audio-visual dataset with 339 performers and approximately 128 hours, including line-level annotations of characters, speakers, gender, vocal manner, and preliminary transcripts. It enables research across Automatic Speaker Verification (ASV) and related tasks in opera, and introduces two domain adaptation methods—Domain Discrepancy Adversarial Learning (DDAL) and Batchwise Contrastive Siamese Training (BCST)—to learn domain-invariant embeddings across stage speech and singing. Experimental results show that combining DDAL and BCST yields robust cross-domain verification performance and demonstrates distinct benefits for different attention mechanisms. Overall, the work establishes a new benchmark for ASV in Chinese opera and provides a data-to-tools loop to advance research in opera analytics and synthesis.

Abstract

This work aims to promote Chinese opera research in both musical and speech domains, with a primary focus on overcoming the data limitations. We introduce KunquDB, a relatively large-scale, well-annotated audio-visual dataset comprising 339 speakers and 128 hours of content. Originating from the Kunqu Opera Art Canon (Kunqu yishu dadian), KunquDB is meticulously structured by dialogue lines, providing explicit annotations including character names, speaker names, gender information, vocal manner classifications, and accompanied by preliminary text transcriptions. KunquDB provides a versatile foundation for role-centric acoustic studies and advancements in speech-related research, including Automatic Speaker Verification (ASV). Beyond enriching opera research, this dataset bridges the gap between artistic expression and technological innovation. Pioneering the exploration of ASV in Chinese opera, we construct four test trials considering two distinct vocal manners in opera voices: stage speech (ST) and singing (S). Implementing domain adaptation methods effectively mitigates domain mismatches induced by these vocal manner variations while there is still room for further improvement as a benchmark.

KunquDB: An Attempt for Speaker Verification in the Chinese Opera Scenario

TL;DR

KunquDB addresses data scarcity in Chinese opera by introducing a large-scale, richly annotated audio-visual dataset with 339 performers and approximately 128 hours, including line-level annotations of characters, speakers, gender, vocal manner, and preliminary transcripts. It enables research across Automatic Speaker Verification (ASV) and related tasks in opera, and introduces two domain adaptation methods—Domain Discrepancy Adversarial Learning (DDAL) and Batchwise Contrastive Siamese Training (BCST)—to learn domain-invariant embeddings across stage speech and singing. Experimental results show that combining DDAL and BCST yields robust cross-domain verification performance and demonstrates distinct benefits for different attention mechanisms. Overall, the work establishes a new benchmark for ASV in Chinese opera and provides a data-to-tools loop to advance research in opera analytics and synthesis.

Abstract

This work aims to promote Chinese opera research in both musical and speech domains, with a primary focus on overcoming the data limitations. We introduce KunquDB, a relatively large-scale, well-annotated audio-visual dataset comprising 339 speakers and 128 hours of content. Originating from the Kunqu Opera Art Canon (Kunqu yishu dadian), KunquDB is meticulously structured by dialogue lines, providing explicit annotations including character names, speaker names, gender information, vocal manner classifications, and accompanied by preliminary text transcriptions. KunquDB provides a versatile foundation for role-centric acoustic studies and advancements in speech-related research, including Automatic Speaker Verification (ASV). Beyond enriching opera research, this dataset bridges the gap between artistic expression and technological innovation. Pioneering the exploration of ASV in Chinese opera, we construct four test trials considering two distinct vocal manners in opera voices: stage speech (ST) and singing (S). Implementing domain adaptation methods effectively mitigates domain mismatches induced by these vocal manner variations while there is still room for further improvement as a benchmark.
Paper Structure (30 sections, 4 equations, 5 figures, 5 tables)

This paper contains 30 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Mel spectrograms with overlaid pitch contours for singing (a), stage speech (b), and regular speech (c).
  • Figure 2: Left: Histogram of utterance lengths in the dataset. Right: Distribution of speaker role type information. The legend indicates the role type performed by speakers throughout the dataset. Dan for young female characters, LaoDan for old female characters, OtherFemale for additional female characters; XiaoSheng for young male characters, LaoSheng for old male characters, OtherMale for additional male characters; and MultiGender means speakers portraying characters of both genders.
  • Figure 3: Schematic of the DDAL framework. The pink dashed box outlines the identity embedding extractor; the green dashed box highlights the core components of the DDAL mechanism.
  • Figure 4: Overview of the BCST structure
  • Figure 5: t-SNE visualization of speaker embedding extracted by seven models (M0$\sim$M6). Unique colors signify individual distinctions, with circular markers ($\medbullet$) representing stage speech utterances and pentagonal stars ($\bigstar$) denoting singing utterances.