Table of Contents
Fetching ...

Sequence-to-sequence models in peer-to-peer learning: A practical application

Robert Šajina, Ivo Ipšić

TL;DR

This work investigates the viability of LSTM-based Seq2Seq models for automatic speech recognition in peer-to-peer learning, comparing decentralized exchanges of localized models against centralized training on pooled data. The authors implement a scaled-down Deep Speech 2 variant and evaluate two P2P aggregation schemes, Pull-gossip and P2P-BN, on UserLibri and LJ Speech datasets, using greedy decoding and CTC loss. Centralized training achieves notably lower Word Error Rates ($84 ext{\%}$ on UserLibri and $38 ext{\%}$ on LJ Speech) than peer-to-peer approaches (ranging roughly $87 ext{\%}$–$92 ext{\%}$ for UserLibri and $52 ext{\%}$–$56 ext{\%}$ for LJ Speech), illustrating the feasibility of decentralized Seq2Seq ASR while highlighting data availability as a key factor. The study concludes that Seq2Seq models can operate in decentralized environments, but slower convergence and higher data requirements in peer-to-peer settings motivate future work on convergence acceleration and robustness when local data are limited.

Abstract

This paper explores the applicability of sequence-to-sequence (Seq2Seq) models based on LSTM units for Automatic Speech Recognition (ASR) task within peer-to-peer learning environments. Leveraging two distinct peer-to-peer learning methods, the study simulates the learning process of agents and evaluates their performance in ASR task using two different ASR datasets. In a centralized training setting, utilizing a scaled-down variant of the Deep Speech 2 model, a single model achieved a Word Error Rate (WER) of 84\% when trained on the UserLibri dataset, and 38\% when trained on the LJ Speech dataset. Conversely, in a peer-to-peer learning scenario involving 55 agents, the WER ranged from 87\% to 92\% for the UserLibri dataset, and from 52\% to 56\% for the LJ Speech dataset. The findings demonstrate the feasibility of employing Seq2Seq models in decentralized settings, albeit with slightly higher Word Error Rates (WER) compared to centralized training methods.

Sequence-to-sequence models in peer-to-peer learning: A practical application

TL;DR

This work investigates the viability of LSTM-based Seq2Seq models for automatic speech recognition in peer-to-peer learning, comparing decentralized exchanges of localized models against centralized training on pooled data. The authors implement a scaled-down Deep Speech 2 variant and evaluate two P2P aggregation schemes, Pull-gossip and P2P-BN, on UserLibri and LJ Speech datasets, using greedy decoding and CTC loss. Centralized training achieves notably lower Word Error Rates ( on UserLibri and on LJ Speech) than peer-to-peer approaches (ranging roughly for UserLibri and for LJ Speech), illustrating the feasibility of decentralized Seq2Seq ASR while highlighting data availability as a key factor. The study concludes that Seq2Seq models can operate in decentralized environments, but slower convergence and higher data requirements in peer-to-peer settings motivate future work on convergence acceleration and robustness when local data are limited.

Abstract

This paper explores the applicability of sequence-to-sequence (Seq2Seq) models based on LSTM units for Automatic Speech Recognition (ASR) task within peer-to-peer learning environments. Leveraging two distinct peer-to-peer learning methods, the study simulates the learning process of agents and evaluates their performance in ASR task using two different ASR datasets. In a centralized training setting, utilizing a scaled-down variant of the Deep Speech 2 model, a single model achieved a Word Error Rate (WER) of 84\% when trained on the UserLibri dataset, and 38\% when trained on the LJ Speech dataset. Conversely, in a peer-to-peer learning scenario involving 55 agents, the WER ranged from 87\% to 92\% for the UserLibri dataset, and from 52\% to 56\% for the LJ Speech dataset. The findings demonstrate the feasibility of employing Seq2Seq models in decentralized settings, albeit with slightly higher Word Error Rates (WER) compared to centralized training methods.
Paper Structure (10 sections, 3 figures)

This paper contains 10 sections, 3 figures.

Figures (3)

  • Figure 1: Data processing and training scheme.
  • Figure 2: An example of raw and processed audio-text pair from the UserLibri dataset.
  • Figure 3: Examples of short audio-text pair with added audio and text padding, compared to a long audio-text pair.