Serial-Parallel Dual-Path Architecture for Speaking Style Recognition
Guojian Li, Qijie Shao, Zhixian Zhao, Shuiyuan Wang, Zhonghua Fu, Lei Xie
TL;DR
This work addresses the limitation of SSR approaches that rely mainly on linguistic information by introducing a serial-parallel dual-path architecture that fuses acoustic and linguistic modalities. The serial path follows an ASR+STYLE paradigm with a frozen Whisper-Medium encoder and a LoRA-finetuned LLM, while the parallel path employs the Acoustic-Linguistic Similarity Module (ALSM) to enable synchronized cross-modal interaction through attention-guided alignment and multi-space cross-modal similarity. Across eight speaking styles, the proposed method achieves an 88.4% reduction in parameters with a 30.3% accuracy improvement over the OSUM baseline, demonstrating the effectiveness and efficiency of bimodal fusion for SSR. The approach highlights the value of combining sequential and synchronized cross-modal processing and points to future extensions to related speech understanding tasks such as emotion recognition and sound event detection.
Abstract
Speaking Style Recognition (SSR) identifies a speaker's speaking style characteristics from speech. Existing style recognition approaches primarily rely on linguistic information, with limited integration of acoustic information, which restricts recognition accuracy improvements. The fusion of acoustic and linguistic modalities offers significant potential to enhance recognition performance. In this paper, we propose a novel serial-parallel dual-path architecture for SSR that leverages acoustic-linguistic bimodal information. The serial path follows the ASR+STYLE serial paradigm, reflecting a sequential temporal dependency, while the parallel path integrates our designed Acoustic-Linguistic Similarity Module (ALSM) to facilitate cross-modal interaction with temporal simultaneity. Compared to the existing SSR baseline -- the OSUM model, our approach reduces parameter size by 88.4% and achieves a 30.3% improvement in SSR accuracy for eight styles on the test set.
