Table of Contents
Fetching ...

Knowledge-Decoupled Functionally Invariant Path with Synthetic Personal Data for Personalized ASR

Yue Gu, Zhihao Du, Ying Shi, Jiqing Han, Yongjun He

TL;DR

This work tackles personalized ASR with synthetic personal data by introducing Knowledge-Decoupled FIP (KDFIP), which separates generic and personalized knowledge into distinct modules and updates them along functionally invariant paths, aided by a gating mechanism to fuse outputs. Synthetic data is generated via zero-shot TTS (CosyVoice 2.0) and incorporated through a staged training process that uses adapters for personalization and a sequential update strategy to maintain generalization. The framework achieves a notable 29.38% relative CER reduction on target speakers with synthetic data, while maintaining comparable performance on non-target speech, and ablation studies highlight the importance of data origin and data duration in driving gains. The proposed approach offers a practical path to scalable, privacy-conscious personalization by leveraging synthetic data without catastrophic forgetting of generic knowledge.

Abstract

Fine-tuning generic ASR models with large-scale synthetic personal data can enhance the personalization of ASR models, but it introduces challenges in adapting to synthetic personal data without forgetting real knowledge, and in adapting to personal data without forgetting generic knowledge. Considering that the functionally invariant path (FIP) framework enables model adaptation while preserving prior knowledge, in this letter, we introduce FIP into synthetic-data-augmented personalized ASR models. However, the model still struggles to balance the learning of synthetic, personalized, and generic knowledge when applying FIP to train the model on all three types of data simultaneously. To decouple this learning process and further address the above two challenges, we integrate a gated parameter-isolation strategy into FIP and propose a knowledge-decoupled functionally invariant path (KDFIP) framework, which stores generic and personalized knowledge in separate modules and applies FIP to them sequentially. Specifically, KDFIP adapts the personalized module to synthetic and real personal data and the generic module to generic data. Both modules are updated along personalization-invariant paths, and their outputs are dynamically fused through a gating mechanism. With augmented synthetic data, KDFIP achieves a 29.38% relative character error rate reduction on target speakers and maintains comparable generalization performance to the unadapted ASR baseline.

Knowledge-Decoupled Functionally Invariant Path with Synthetic Personal Data for Personalized ASR

TL;DR

This work tackles personalized ASR with synthetic personal data by introducing Knowledge-Decoupled FIP (KDFIP), which separates generic and personalized knowledge into distinct modules and updates them along functionally invariant paths, aided by a gating mechanism to fuse outputs. Synthetic data is generated via zero-shot TTS (CosyVoice 2.0) and incorporated through a staged training process that uses adapters for personalization and a sequential update strategy to maintain generalization. The framework achieves a notable 29.38% relative CER reduction on target speakers with synthetic data, while maintaining comparable performance on non-target speech, and ablation studies highlight the importance of data origin and data duration in driving gains. The proposed approach offers a practical path to scalable, privacy-conscious personalization by leveraging synthetic data without catastrophic forgetting of generic knowledge.

Abstract

Fine-tuning generic ASR models with large-scale synthetic personal data can enhance the personalization of ASR models, but it introduces challenges in adapting to synthetic personal data without forgetting real knowledge, and in adapting to personal data without forgetting generic knowledge. Considering that the functionally invariant path (FIP) framework enables model adaptation while preserving prior knowledge, in this letter, we introduce FIP into synthetic-data-augmented personalized ASR models. However, the model still struggles to balance the learning of synthetic, personalized, and generic knowledge when applying FIP to train the model on all three types of data simultaneously. To decouple this learning process and further address the above two challenges, we integrate a gated parameter-isolation strategy into FIP and propose a knowledge-decoupled functionally invariant path (KDFIP) framework, which stores generic and personalized knowledge in separate modules and applies FIP to them sequentially. Specifically, KDFIP adapts the personalized module to synthetic and real personal data and the generic module to generic data. Both modules are updated along personalization-invariant paths, and their outputs are dynamically fused through a gating mechanism. With augmented synthetic data, KDFIP achieves a 29.38% relative character error rate reduction on target speakers and maintains comparable generalization performance to the unadapted ASR baseline.

Paper Structure

This paper contains 11 sections, 11 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Schematic of KDFIP construction in the weight space $(\mathrm{w}_1$,$\mathrm{w}_{n}\dots\mathrm{w}_N)$ of ASR models for sequential training on personal and generic data, where the generic module $\mathbf{w}_\mathrm{b}$ and personalized module $\mathbf{w}_\mathrm{a}$ correspond to $(\mathrm{w}_1$,$\dots$,$\mathrm{w}_n)$ and $(\mathrm{w}_{n+1}$,$\dots$,$\mathrm{w}_N)$, respectively. The spherical and hemispherical shapes represent sets of model parameters at each stage. $\nabla L$ and $\mathrm{d}\boldsymbol{w}$ denote the gradient with respect to $L$ and the perturbation in the weights $\boldsymbol{w}$, respectively.
  • Figure 2: Ablation on the duration of synthetic personal data in "Stage 3" of KDFIP.
  • Figure 3: Interpolated hyperparameter tuning of KDFIP.