Table of Contents
Fetching ...

LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation

Jun Chen, Shichao Hu, Jiuxin Lin, Wenjie Li, Zihan Zhang, Xingchen Li, JinJiang Liu, Longshuai Xiao, Chao Weng, Lei Xie, Zhiyong Wu

TL;DR

The paper tackles real-time in-car multi-zone speech separation under limited compute resources. It proposes LSZone, a lightweight architecture that combines SpaIEC, which fuses Mel spectrograms with Interaural Phase Difference to reduce feature dimensionality, and an ultra-efficient Conv-GRU CNP module for crossband-narrowband spatial–frequency–temporal modeling. Empirical results show LSZone achieves strong performance with only 0.56G MACs and a real-time factor of 0.37, outperforming baselines like Zoneformer, DualSep, and SpatialNet in CER and FIR across single and multi-speaker scenarios, even when evaluated on different ASR backends. The work demonstrates practical potential for deploying high-quality, real-time speech separation in vehicles, improving human-vehicle interaction while minimizing audio leakage between zones.

Abstract

In-car multi-zone speech separation, which captures voices from different speech zones, plays a crucial role in human-vehicle interaction. Although previous SpatialNet has achieved notable results, its high computational cost still hinders real-time applications in vehicles. To this end, this paper proposes LSZone, a lightweight spatial information modeling architecture for real-time in-car multi-zone speech separation. We design a spatial information extraction-compression (SpaIEC) module that combines Mel spectrogram and Interaural Phase Difference (IPD) to reduce computational burden while maintaining performance. Additionally, to efficiently model spatial information, we introduce an extremely lightweight Conv-GRU crossband-narrowband processing (CNP) module. Experimental results demonstrate that LSZone, with a complexity of 0.56G MACs and a real-time factor (RTF) of 0.37, delivers impressive performance in complex noise and multi-speaker scenarios.

LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation

TL;DR

The paper tackles real-time in-car multi-zone speech separation under limited compute resources. It proposes LSZone, a lightweight architecture that combines SpaIEC, which fuses Mel spectrograms with Interaural Phase Difference to reduce feature dimensionality, and an ultra-efficient Conv-GRU CNP module for crossband-narrowband spatial–frequency–temporal modeling. Empirical results show LSZone achieves strong performance with only 0.56G MACs and a real-time factor of 0.37, outperforming baselines like Zoneformer, DualSep, and SpatialNet in CER and FIR across single and multi-speaker scenarios, even when evaluated on different ASR backends. The work demonstrates practical potential for deploying high-quality, real-time speech separation in vehicles, improving human-vehicle interaction while minimizing audio leakage between zones.

Abstract

In-car multi-zone speech separation, which captures voices from different speech zones, plays a crucial role in human-vehicle interaction. Although previous SpatialNet has achieved notable results, its high computational cost still hinders real-time applications in vehicles. To this end, this paper proposes LSZone, a lightweight spatial information modeling architecture for real-time in-car multi-zone speech separation. We design a spatial information extraction-compression (SpaIEC) module that combines Mel spectrogram and Interaural Phase Difference (IPD) to reduce computational burden while maintaining performance. Additionally, to efficiently model spatial information, we introduce an extremely lightweight Conv-GRU crossband-narrowband processing (CNP) module. Experimental results demonstrate that LSZone, with a complexity of 0.56G MACs and a real-time factor (RTF) of 0.37, delivers impressive performance in complex noise and multi-speaker scenarios.

Paper Structure

This paper contains 13 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overall diagram of LSZone. It mainly comprises the SpaIEC Module and a backbone structure consisting of Conv1D, Conv-GRU CNP Modules, and a linear layer.
  • Figure 2: The details of the SpaIEC Module, which primarily consists of a Conv1D Squeezer and a Gate Fusion mechanism.
  • Figure 3: The specifics of the Conv-GRU CNP Module. The "F-Conv1d" indicates the Frequency Conv1d while the "G-Conv1d" denotes the Group Conv1d.