Spectral-Aware Low-Rank Adaptation for Speaker Verification

Zhe Li; Man-wai Mak; Mert Pilanci; Hung-yi Lee; Helen Meng

Spectral-Aware Low-Rank Adaptation for Speaker Verification

Zhe Li, Man-wai Mak, Mert Pilanci, Hung-yi Lee, Helen Meng

TL;DR

The paper addresses the limitation of conventional PEFT methods like LoRA that do not exploit spectral structure in pre-trained weights. It introduces SpectralFT, a spectral-aware fine-tuning scheme that decomposes weight matrices via SVD into a principal subspace $W_p$ and a minor subspace $W_m$, freezes $W_p$, and applies LoRA-style adapters to the top spectral components of $W_p$ through $\Delta_U$ and $\Delta_V$. Experiments on VoxCeleb1 and CN-Celeb1 using HuBERT-Large or WavLM-Large as pre-trained models and ECAPA-TDNN as the speaker encoder show that SpectralFT outperforms Adapter, static prompt tuning, LoRA, and other baselines, especially when tuning $\mathbf{W}_q$ and $\mathbf{W}_k$ with a moderate rank $r$ and top components $k$. The findings indicate that focusing adaptation within the top spectral space preserves essential pre-trained knowledge while enabling task-specific refinement, yielding improved speaker verification performance with modest computational overhead. This spectral-guided PEFT approach offers a practical path to efficient, high-capacity fine-tuning for speech applications and potentially beyond.

Abstract

Previous research has shown that the principal singular vectors of a pre-trained model's weight matrices capture critical knowledge. In contrast, those associated with small singular values may contain noise or less reliable information. As a result, the LoRA-based parameter-efficient fine-tuning (PEFT) approach, which does not constrain the use of the spectral space, may not be effective for tasks that demand high representation capacity. In this study, we enhance existing PEFT techniques by incorporating the spectral information of pre-trained weight matrices into the fine-tuning process. We investigate spectral adaptation strategies with a particular focus on the additive adjustment of top singular vectors. This is accomplished by applying singular value decomposition (SVD) to the pre-trained weight matrices and restricting the fine-tuning within the top spectral space. Extensive speaker verification experiments on VoxCeleb1 and CN-Celeb1 demonstrate enhanced tuning performance with the proposed approach. Code is released at https://github.com/lizhepolyu/SpectralFT.

Spectral-Aware Low-Rank Adaptation for Speaker Verification

TL;DR

and a minor subspace

, freezes

, and applies LoRA-style adapters to the top spectral components of

through

and

. Experiments on VoxCeleb1 and CN-Celeb1 using HuBERT-Large or WavLM-Large as pre-trained models and ECAPA-TDNN as the speaker encoder show that SpectralFT outperforms Adapter, static prompt tuning, LoRA, and other baselines, especially when tuning

and

with a moderate rank

and top components

. The findings indicate that focusing adaptation within the top spectral space preserves essential pre-trained knowledge while enabling task-specific refinement, yielding improved speaker verification performance with modest computational overhead. This spectral-guided PEFT approach offers a practical path to efficient, high-capacity fine-tuning for speech applications and potentially beyond.

Abstract

Paper Structure (14 sections, 3 equations, 2 figures, 4 tables)

This paper contains 14 sections, 3 equations, 2 figures, 4 tables.

Introduction
Methodology
Low-Rank Adaptation
Singular Value Decomposition
Spectral Fine-tuning
Computation Considerations
Experiments and Results
Implementation Details
Results and Analysis
Investigating Different Rank Settings
Analysis of Principle Columns
Analysis of the Effect of Singular Vectors
Analyze the Fine-tuning Positions
Conclusions

Figures (2)

Figure 1: The architecture of the proposed SpectralFT. The principal singular components $(\bm{U}_p, \bm{V}_p, \bm{\Sigma}_p)$ are retained to form a low-rank approximation of the original weight matrix $\bm{W}$, which is then fine-tuned using the principle of LoRA. During fine-tuning, only the low-rank matrices $\bm{B}_U$, $\bm{A}_U$, $\bm{B}_V$, and $\bm{A}_V$ are updated, while the principal matrices $\bm{U}_p$ and $\bm{V}_p$ remain frozen. For the operations and principles of the Transformer Encoder, Pre-trained Network, and Speaker Classifier, readers are referred to li2024dualli2024parameter.
Figure 2: Results on VoxCeleb1-O for different ranks, using WavLM-Large as the PTM.

Spectral-Aware Low-Rank Adaptation for Speaker Verification

TL;DR

Abstract

Spectral-Aware Low-Rank Adaptation for Speaker Verification

Authors

TL;DR

Abstract

Table of Contents

Figures (2)