LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation
Jun Chen, Shichao Hu, Jiuxin Lin, Wenjie Li, Zihan Zhang, Xingchen Li, JinJiang Liu, Longshuai Xiao, Chao Weng, Lei Xie, Zhiyong Wu
TL;DR
The paper tackles real-time in-car multi-zone speech separation under limited compute resources. It proposes LSZone, a lightweight architecture that combines SpaIEC, which fuses Mel spectrograms with Interaural Phase Difference to reduce feature dimensionality, and an ultra-efficient Conv-GRU CNP module for crossband-narrowband spatial–frequency–temporal modeling. Empirical results show LSZone achieves strong performance with only 0.56G MACs and a real-time factor of 0.37, outperforming baselines like Zoneformer, DualSep, and SpatialNet in CER and FIR across single and multi-speaker scenarios, even when evaluated on different ASR backends. The work demonstrates practical potential for deploying high-quality, real-time speech separation in vehicles, improving human-vehicle interaction while minimizing audio leakage between zones.
Abstract
In-car multi-zone speech separation, which captures voices from different speech zones, plays a crucial role in human-vehicle interaction. Although previous SpatialNet has achieved notable results, its high computational cost still hinders real-time applications in vehicles. To this end, this paper proposes LSZone, a lightweight spatial information modeling architecture for real-time in-car multi-zone speech separation. We design a spatial information extraction-compression (SpaIEC) module that combines Mel spectrogram and Interaural Phase Difference (IPD) to reduce computational burden while maintaining performance. Additionally, to efficiently model spatial information, we introduce an extremely lightweight Conv-GRU crossband-narrowband processing (CNP) module. Experimental results demonstrate that LSZone, with a complexity of 0.56G MACs and a real-time factor (RTF) of 0.37, delivers impressive performance in complex noise and multi-speaker scenarios.
