Sequence-to-sequence models in peer-to-peer learning: A practical application
Robert Šajina, Ivo Ipšić
TL;DR
This work investigates the viability of LSTM-based Seq2Seq models for automatic speech recognition in peer-to-peer learning, comparing decentralized exchanges of localized models against centralized training on pooled data. The authors implement a scaled-down Deep Speech 2 variant and evaluate two P2P aggregation schemes, Pull-gossip and P2P-BN, on UserLibri and LJ Speech datasets, using greedy decoding and CTC loss. Centralized training achieves notably lower Word Error Rates ($84 ext{\%}$ on UserLibri and $38 ext{\%}$ on LJ Speech) than peer-to-peer approaches (ranging roughly $87 ext{\%}$–$92 ext{\%}$ for UserLibri and $52 ext{\%}$–$56 ext{\%}$ for LJ Speech), illustrating the feasibility of decentralized Seq2Seq ASR while highlighting data availability as a key factor. The study concludes that Seq2Seq models can operate in decentralized environments, but slower convergence and higher data requirements in peer-to-peer settings motivate future work on convergence acceleration and robustness when local data are limited.
Abstract
This paper explores the applicability of sequence-to-sequence (Seq2Seq) models based on LSTM units for Automatic Speech Recognition (ASR) task within peer-to-peer learning environments. Leveraging two distinct peer-to-peer learning methods, the study simulates the learning process of agents and evaluates their performance in ASR task using two different ASR datasets. In a centralized training setting, utilizing a scaled-down variant of the Deep Speech 2 model, a single model achieved a Word Error Rate (WER) of 84\% when trained on the UserLibri dataset, and 38\% when trained on the LJ Speech dataset. Conversely, in a peer-to-peer learning scenario involving 55 agents, the WER ranged from 87\% to 92\% for the UserLibri dataset, and from 52\% to 56\% for the LJ Speech dataset. The findings demonstrate the feasibility of employing Seq2Seq models in decentralized settings, albeit with slightly higher Word Error Rates (WER) compared to centralized training methods.
