Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

Allahdadi Fatemeh; Mahdian Toroghi Rahil; Zareian Hassan

Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

Allahdadi Fatemeh, Mahdian Toroghi Rahil, Zareian Hassan

TL;DR

This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture that offers the advantage of accurately counting query term repetitions within the target audio.

Abstract

Query-by-example spoken term detection (QbE-STD) is typically constrained by transcribed data scarcity and language specificity. This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture. By employing a pre-trained XLSR-53 network for feature extraction and a Hough transform for detection, our model effectively searches for user-defined spoken terms within any audio file. Experimental results across four languages demonstrate significant performance gains (19-54%) over a CNN-based baseline. While processing time is improved compared to DTW, accuracy remains inferior. Notably, our model offers the advantage of accurately counting query term repetitions within the target audio.

Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

TL;DR

Abstract

Paper Structure (15 sections, 3 equations, 10 figures, 3 tables, 2 algorithms)

This paper contains 15 sections, 3 equations, 10 figures, 3 tables, 2 algorithms.

Introduction
Methodology
Feature Extraction
Distance Matrix
Template Recognition
Experimental Setup and Results
Datasets
Baseline Models
Evaluation Metric
Experiments
Feature Extraction Block Performance
Distance Matrix Block Performance
Template Recognition Block Performance
QbE-STD Performance
Conclusions

Figures (10)

Figure 1: Overview of the Query-by-Example Spoken Term Detection (QbE-STD)
Figure 2: Block diagram of the proposed QbE-STD method
Figure 3: The XLSR approach. A shared quantization module over feature encoder representations produces multi-lingual quantized speech units whose embeddings are then used as targets for a transformer trained by contrastive learning. The model creatins bridges across languages 2.
Figure 4: Similarity matrix image in 0 and 1-class
Figure 5: Right side, confusion matrix of model performance in Farsi1 dataset. Left, confusion matrix of model performance in multilingual dataset.
...and 5 more figures

Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

TL;DR

Abstract

Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (10)