Table of Contents
Fetching ...

Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

Allahdadi Fatemeh, Mahdian Toroghi Rahil, Zareian Hassan

TL;DR

This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture that offers the advantage of accurately counting query term repetitions within the target audio.

Abstract

Query-by-example spoken term detection (QbE-STD) is typically constrained by transcribed data scarcity and language specificity. This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture. By employing a pre-trained XLSR-53 network for feature extraction and a Hough transform for detection, our model effectively searches for user-defined spoken terms within any audio file. Experimental results across four languages demonstrate significant performance gains (19-54%) over a CNN-based baseline. While processing time is improved compared to DTW, accuracy remains inferior. Notably, our model offers the advantage of accurately counting query term repetitions within the target audio.

Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

TL;DR

This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture that offers the advantage of accurately counting query term repetitions within the target audio.

Abstract

Query-by-example spoken term detection (QbE-STD) is typically constrained by transcribed data scarcity and language specificity. This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture. By employing a pre-trained XLSR-53 network for feature extraction and a Hough transform for detection, our model effectively searches for user-defined spoken terms within any audio file. Experimental results across four languages demonstrate significant performance gains (19-54%) over a CNN-based baseline. While processing time is improved compared to DTW, accuracy remains inferior. Notably, our model offers the advantage of accurately counting query term repetitions within the target audio.
Paper Structure (15 sections, 3 equations, 10 figures, 3 tables, 2 algorithms)

This paper contains 15 sections, 3 equations, 10 figures, 3 tables, 2 algorithms.

Figures (10)

  • Figure 1: Overview of the Query-by-Example Spoken Term Detection (QbE-STD)
  • Figure 2: Block diagram of the proposed QbE-STD method
  • Figure 3: The XLSR approach. A shared quantization module over feature encoder representations produces multi-lingual quantized speech units whose embeddings are then used as targets for a transformer trained by contrastive learning. The model creatins bridges across languages 2.
  • Figure 4: Similarity matrix image in 0 and 1-class
  • Figure 5: Right side, confusion matrix of model performance in Farsi1 dataset. Left, confusion matrix of model performance in multilingual dataset.
  • ...and 5 more figures