Table of Contents
Fetching ...

Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Recognition

Hao Yen, Pin-Jui Ku, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Yu Tsao

TL;DR

This work tackles low-resource spoken command recognition by reusing a pretrained acoustic model through adversarial reprogramming (AR). It introduces a similarity-based label mapping to better align source and target classes and combines AR with transfer learning to enhance adaptation when target data are scarce. Experiments on Lithuanian, Arabic, and dysarthric Mandarin datasets show that AR combined with TL (and occasional data augmentation) yields substantial gains over baselines and can surpass state-of-the-art results despite limited target data. The findings demonstrate AR's viability as a flexible front-end adaptation technique that complements TL and augmentation for practical SCR deployment.

Abstract

In this study, we propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR), and build an AR-SCR system. The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model (from the source domain). To solve the label mismatches between source and target domains, and further improve the stability of AR, we propose a novel similarity-based label mapping technique to align classes. In addition, the transfer learning (TL) technique is combined with the original AR process to improve the model adaptation capability. We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech. Experimental results show that with a pretrained AM trained on a large-scale English dataset, the proposed AR-SCR system outperforms the current state-of-the-art results on Arabic and Lithuanian speech commands datasets, with only a limited amount of training data.

Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Recognition

TL;DR

This work tackles low-resource spoken command recognition by reusing a pretrained acoustic model through adversarial reprogramming (AR). It introduces a similarity-based label mapping to better align source and target classes and combines AR with transfer learning to enhance adaptation when target data are scarce. Experiments on Lithuanian, Arabic, and dysarthric Mandarin datasets show that AR combined with TL (and occasional data augmentation) yields substantial gains over baselines and can surpass state-of-the-art results despite limited target data. The findings demonstrate AR's viability as a flexible front-end adaptation technique that complements TL and augmentation for practical SCR deployment.

Abstract

In this study, we propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR), and build an AR-SCR system. The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model (from the source domain). To solve the label mismatches between source and target domains, and further improve the stability of AR, we propose a novel similarity-based label mapping technique to align classes. In addition, the transfer learning (TL) technique is combined with the original AR process to improve the model adaptation capability. We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech. Experimental results show that with a pretrained AM trained on a large-scale English dataset, the proposed AR-SCR system outperforms the current state-of-the-art results on Arabic and Lithuanian speech commands datasets, with only a limited amount of training data.

Paper Structure

This paper contains 12 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of the proposed AR-SCR system. The acoustic signals of a Lithuanian command ("ne") is reprogrammed to English commands ("nine" and "no") and mapped to its final prediction with a pretrained English acoustic model.
  • Figure 2: Frameworks studied in this work. The "AM" block refers to the acoustic model. In (a), the baseline system is trained from scratch on the target domain data. In (b), AM is pretrained and then fine-tuned on the target domain data. In (c), AM is pretrained on the source domain and then fixed; an adversarial reprogram (AR) layer is then placed before the pretrained AM model to modify the input signals. In (d), we combine AR and TL to train the adversarial reprogram layer and fine-tune AM simultaneously.
  • Figure 3: PCA plots of average representations of several source-target pairs for (a) English-Lithuanian and (b) English-Arabic datasets. A target class (star point) is mapped to two or three source classes (circle points) with higher cosine similarity (marked with same colors).
  • Figure 4: The best-1 accuracy values of Baseline, the state-of-the-art wav2vec model Kolesau2020, and the proposed AR+TL system on the Lithuanian speech command dataset. We follow the same evaluation metrics in Kolesau2020 and only reported the best-1 test accuracy values.