Table of Contents
Fetching ...

PI-Whisper: Designing an Adaptive and Incremental Automatic Speech Recognition System for Edge Devices

Amir Nassereldine, Dancheng Liu, Chenhui Xu, Ruiyang Qin, Yiyu Shi, Jinjun Xiong

TL;DR

PI-Whisper presents an adaptive edge ASR framework that incrementally adapts to speaker characteristics by learning multiple LoRA profiles and merging them at inference. It employs a lightweight speaker-characteristic classifier to identify attributes and loads corresponding LoRA profiles from dedicated libraries, enabling non intrusive personalization without full model retraining. The approach achieves state-of-the-art WER on the evaluated datasets with up to 13.7% relative improvement and modest overhead, and demonstrates zero-shot transfer and fairness improvements across diverse speaker groups. The work shows practical potential for personalized, privacy-preserving ASR on edge devices.

Abstract

Edge-based automatic speech recognition (ASR) technologies are increasingly prevalent in the development of intelligent and personalized assistants. However, resource-constrained ASR models face significant challenges in adaptivity, incrementality, and inclusivity when faced with a diverse population. To tackle those challenges, we propose PI-Whisper, a novel ASR system that adaptively enhances recognition capabilities by identifying speakers' characteristics in real-time. In this work, we show how the design of PI-Whisper allows for incremental adaptation of new characteristics without the need for repetitive retraining, enhances recognition capabilities, and improves equity and fairness across diverse speaker groups. PI-Whisper demonstrates these advantages by achieving state-of-the-art accuracy, reducing the word error rate (WER) by up to 13.7% relative to baselines while scaling linearly to computing resources.

PI-Whisper: Designing an Adaptive and Incremental Automatic Speech Recognition System for Edge Devices

TL;DR

PI-Whisper presents an adaptive edge ASR framework that incrementally adapts to speaker characteristics by learning multiple LoRA profiles and merging them at inference. It employs a lightweight speaker-characteristic classifier to identify attributes and loads corresponding LoRA profiles from dedicated libraries, enabling non intrusive personalization without full model retraining. The approach achieves state-of-the-art WER on the evaluated datasets with up to 13.7% relative improvement and modest overhead, and demonstrates zero-shot transfer and fairness improvements across diverse speaker groups. The work shows practical potential for personalized, privacy-preserving ASR on edge devices.

Abstract

Edge-based automatic speech recognition (ASR) technologies are increasingly prevalent in the development of intelligent and personalized assistants. However, resource-constrained ASR models face significant challenges in adaptivity, incrementality, and inclusivity when faced with a diverse population. To tackle those challenges, we propose PI-Whisper, a novel ASR system that adaptively enhances recognition capabilities by identifying speakers' characteristics in real-time. In this work, we show how the design of PI-Whisper allows for incremental adaptation of new characteristics without the need for repetitive retraining, enhances recognition capabilities, and improves equity and fairness across diverse speaker groups. PI-Whisper demonstrates these advantages by achieving state-of-the-art accuracy, reducing the word error rate (WER) by up to 13.7% relative to baselines while scaling linearly to computing resources.
Paper Structure (22 sections, 7 equations, 4 figures, 6 tables)

This paper contains 22 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: PI-Whisper leverages multiple LoRA profile libraries and dynamic LoRA merging of LoRA profiles to adjust ASR towards the speaker's characteristics. When the speakers' characteristics are not known, PI-Whisper employs a multi-head classifier to infer the characteristics from audio samples.
  • Figure 2: Comparison between PI-Whisper (in orange) and other baseline models on the CommonVoice dataset. All baseline results (in blue) are obtained from prabhu2023accented. Ours (Ac) is the PI-Whisper with the accent profile library only, while Ours (All) is the PI-Whisper with all three profile libraries, namely accent, gender, and age.
  • Figure 3: Impact of LoRA profiles on inference time and Word Error Rate (WER) for Raspberry Pi and Jetson devices, highlighting baseline performance, overhead contributions, and WER trends across Known and Inferred settings.
  • Figure 4: Impact of LoRA profiles on memory usage with dynamic profile loading, illustrating the breakdown of baseline memory, profile overhead, and classifier overhead as the number of profiles increases.