Table of Contents
Fetching ...

Bridging the Reality Gap: Efficient Adaptation of ASR systems for Challenging Low-Resource Domains

Darshil Chauhan, Adityasinh Solanki, Vansh Patel, Kanav Kapoor, Ritvik Jain, Aditya Bansal, Pratik Narang, Dhruv Kumar

TL;DR

The paper tackles the reality gap hindering ASR deployment in low-resource, privacy-constrained domains by proposing an on-device continual learning framework using Low-Rank Adaptation (LoRA) and multi-domain experience replay. This privacy-preserving approach enables continual adaptation directly on edge devices, mitigating catastrophic forgetting while improving target-domain WER by 17.1% relative on Gram Vaani. Through extensive experiments with three strategies, the authors show that a multi-domain replay buffer yields the best balance between specialization and general linguistic stability, achieving a final WER of 33.94% on the target domain and a 47% reduction in forgetting on the general domain. The work demonstrates a viable path for real-world, self-improving ASR systems in resource-limited environments like rural healthcare, where data residency and limited compute are critical constraints.

Abstract

Automatic Speech Recognition (ASR) holds immense potential to streamline clinical documentation, such as digitizing handwritten prescriptions and reports, thereby increasing patient throughput and reducing costs in resource-constrained sectors like rural healthcare. However, realizing this utility is currently obstructed by significant technical barriers: strict data privacy constraints, limited computational resources, and severe acoustic domain shifts. We quantify this gap by showing that a robust multilingual model (IndicWav2Vec) degrades to a stark 40.94% Word Error Rate (WER) when deployed on real-world clinical audio (Gram Vaani), rendering it unusable for practical applications. To address these challenges and bring ASR closer to deployment, we propose an efficient, privacy-preserving adaptation framework. We employ Low-Rank Adaptation (LoRA) to enable continual learning from incoming data streams directly on edge devices, ensuring patient data confidentiality. Our strategy yields a 17.1% relative improvement in WER on the target domain. Furthermore, by integrating multi-domain experience replay, we reduce catastrophic forgetting by 47% compared to naive adaptation. These results demonstrate a viable pathway for building reliable, self-improving ASR systems that can operate effectively within the constraints of high-impact real-world environments.

Bridging the Reality Gap: Efficient Adaptation of ASR systems for Challenging Low-Resource Domains

TL;DR

The paper tackles the reality gap hindering ASR deployment in low-resource, privacy-constrained domains by proposing an on-device continual learning framework using Low-Rank Adaptation (LoRA) and multi-domain experience replay. This privacy-preserving approach enables continual adaptation directly on edge devices, mitigating catastrophic forgetting while improving target-domain WER by 17.1% relative on Gram Vaani. Through extensive experiments with three strategies, the authors show that a multi-domain replay buffer yields the best balance between specialization and general linguistic stability, achieving a final WER of 33.94% on the target domain and a 47% reduction in forgetting on the general domain. The work demonstrates a viable path for real-world, self-improving ASR systems in resource-limited environments like rural healthcare, where data residency and limited compute are critical constraints.

Abstract

Automatic Speech Recognition (ASR) holds immense potential to streamline clinical documentation, such as digitizing handwritten prescriptions and reports, thereby increasing patient throughput and reducing costs in resource-constrained sectors like rural healthcare. However, realizing this utility is currently obstructed by significant technical barriers: strict data privacy constraints, limited computational resources, and severe acoustic domain shifts. We quantify this gap by showing that a robust multilingual model (IndicWav2Vec) degrades to a stark 40.94% Word Error Rate (WER) when deployed on real-world clinical audio (Gram Vaani), rendering it unusable for practical applications. To address these challenges and bring ASR closer to deployment, we propose an efficient, privacy-preserving adaptation framework. We employ Low-Rank Adaptation (LoRA) to enable continual learning from incoming data streams directly on edge devices, ensuring patient data confidentiality. Our strategy yields a 17.1% relative improvement in WER on the target domain. Furthermore, by integrating multi-domain experience replay, we reduce catastrophic forgetting by 47% compared to naive adaptation. These results demonstrate a viable pathway for building reliable, self-improving ASR systems that can operate effectively within the constraints of high-impact real-world environments.

Paper Structure

This paper contains 33 sections, 4 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Progression of (a) Word Error Rate (WER) and (b) Character Error Rate (CER) on the Gram Vaani target domain during continual adaptation. The Multi-Domain Replay strategy (V3.1) demonstrates superior stability and final performance compared to naive fine-tuning (V1.1) and single-domain replay (V2.1).
  • Figure 2: Catastrophic forgetting analysis on the Kathbath general domain. (a) WER and (b) CER on the held-out Kathbath test set as the model adapts to Gram Vaani. The shaded regions represent the performance degradation relative to the baseline. V3.1 (green) significantly reduces forgetting compared to V1.1 (red) and V2.1 (orange).
  • Figure 3: Training progression for V1: Naive Fine-tuning (Conservative Warmup).
  • Figure 4: Training progression for V2: Single-Domain Replay (Conservative Warmup).
  • Figure 5: Training progression for V3: Multi-Domain Replay (Conservative Warmup).
  • ...and 5 more figures