Dialect Identification Using Resource-Efficient Fine-Tuning Approaches
Zirui Lin, Haris Gulzar, Monnika Roslianna Busto, Akiko Masaki, Takeharu Eda, Kazuhiro Nakadai
TL;DR
This work tackles the resource-intensive fine-tuning of large speech models for dialect identification by applying Memory-Efficient Fine-Tuning (MEFT) techniques. It introduces three MEFT methods—Ladder Side-Tuning (LST), Universal Parallel Tuning (UniPT), and SHERL—on top of the Whisper encoder, achieving substantial GPU memory reductions (up to 73.25%) and up to 2.1x faster training with accuracy close to full fine-tuning and PEFT baselines on the KeSpeech Mandarin subdialect dataset. The study demonstrates that MEFT can match vanilla performance while significantly easing computational demands, addressing practical constraints for DI on low-resource GPUs. The results suggest MEFT's potential for broader speech tasks, with code and mechanisms outlined to extend resource-efficient fine-tuning in speech processing.
Abstract
Dialect Identification (DI) is a task to recognize different dialects within the same language from a speech signal. DI can help to improve the downstream speech related tasks even when speakers have a strong dialect. However, fine-tuning a speech model for tasks like DI is expensive in terms of computation cost and memory requirement. Recent studies have explored fine-tuning pre-trained speech models for tasks like DI using Parameter-Efficient Fine-Tuning (PEFT) methods, which offer parameter efficiency but limited improvement in memory efficiency and training speed. To address these challenges, we explore Memory-Efficient Fine-Tuning (MEFT) methods, originally proposed for language processing, and apply them to the general-purpose pre-trained speech model. We then comprehensively analyze the GPU memory usage and fine-tuning speed based on various MEFT methods. As a case study, we fine-tune the Whisper model to identify six Mandarin subdialects from the KeSpeech dataset, reducing GPU memory usage by up to 73.25% and accelerating training speed by a factor of 2.1, while maintaining accuracy comparable to vanilla fine-tuning and PEFT methods.
