Table of Contents
Fetching ...

Input-Adaptive Spectral Feature Compression by Sequence Modeling for Source Separation

Kohei Saijo, Yoshiaki Bando

TL;DR

Two variants of SFC are investigated, one based on cross-attention and the other on Mamba, and inductive biases inspired by the BS module are introduced to make them suitable for frequency information compression.

Abstract

Time-frequency domain dual-path models have demonstrated strong performance and are widely used in source separation. Because their computational cost grows with the number of frequency bins, these models often use the band-split (BS) module in high-sampling-rate tasks such as music source separation (MSS) and cinematic audio source separation (CASS). The BS encoder compresses frequency information by encoding features for each predefined subband. It achieves effective compression by introducing an inductive bias that places greater emphasis on low-frequency parts. Despite its success, the BS module has two inherent limitations: (i) it is not input-adaptive, preventing the use of input-dependent information, and (ii) the parameter count is large, since each subband requires a dedicated module. To address these issues, we propose Spectral Feature Compression (SFC). SFC compresses the input using a single sequence modeling module, making it both input-adaptive and parameter-efficient. We investigate two variants of SFC, one based on cross-attention and the other on Mamba, and introduce inductive biases inspired by the BS module to make them suitable for frequency information compression. Experiments on MSS and CASS tasks demonstrate that the SFC module consistently outperforms the BS module across different separator sizes and compression ratios. We also provide an analysis showing that SFC adaptively captures frequency patterns from the input.

Input-Adaptive Spectral Feature Compression by Sequence Modeling for Source Separation

TL;DR

Two variants of SFC are investigated, one based on cross-attention and the other on Mamba, and inductive biases inspired by the BS module are introduced to make them suitable for frequency information compression.

Abstract

Time-frequency domain dual-path models have demonstrated strong performance and are widely used in source separation. Because their computational cost grows with the number of frequency bins, these models often use the band-split (BS) module in high-sampling-rate tasks such as music source separation (MSS) and cinematic audio source separation (CASS). The BS encoder compresses frequency information by encoding features for each predefined subband. It achieves effective compression by introducing an inductive bias that places greater emphasis on low-frequency parts. Despite its success, the BS module has two inherent limitations: (i) it is not input-adaptive, preventing the use of input-dependent information, and (ii) the parameter count is large, since each subband requires a dedicated module. To address these issues, we propose Spectral Feature Compression (SFC). SFC compresses the input using a single sequence modeling module, making it both input-adaptive and parameter-efficient. We investigate two variants of SFC, one based on cross-attention and the other on Mamba, and introduce inductive biases inspired by the BS module to make them suitable for frequency information compression. Experiments on MSS and CASS tasks demonstrate that the SFC module consistently outperforms the BS module across different separator sizes and compression ratios. We also provide an analysis showing that SFC adaptively captures frequency patterns from the input.
Paper Structure (40 sections, 15 equations, 6 figures, 8 tables)

This paper contains 40 sections, 15 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of (a) TF-domain dual-path separation models, (b) band-split (BS) encoder, and (c) proposed spectral feature compression (SFC) encoder. The BS module divides an input spectrogram with $F$ frequency bins into $K$ subband spectrograms based on a predefined band configuration (e.g., mel) and processes them with $K$ different sub-encoders to compress the inputs into $K$ subband features. The proposed SFC also compresses the input into $K$ subband features, but does so with a single sequence modeling module using $K$ queries, making it both input-adaptive and parameter-efficient.
  • Figure 2: Detailed architecture of the proposed SFC using cross-attention (SFC-CA). The SFC-CA encoder compresses input spectral features of length $F$ into compressed features of length $K$ through cross-attention with randomly initialized learnable queries of length $K$. The decoder uncompresses the features in a similar manner, but with queries of length $F$. To introduce a psychoacoustically motivated inductive bias, similarly to the BS module, we design a positional bias (right; Sec. \ref{['sssec:crossattn_with_inductive_bias']}) based on the band definition $G$, which assigns higher values to frequency bins $f$ included in the $k$-th band. The vertical boxes represent the data at a specific time frame $t$, whereas the shapes of the variables refer to those of the entire tensor across all $T$ time frames.
  • Figure 3: Detailed architecture of proposed SFC using recurrent models (SFC-Mamba). SFC-Mamba encoder first interleaves the input feature and query. Mamba is then applied, and outputs at the queries' position (blue boxes) are used as the compressed feature input to the separator. We apply Mamba bidirectionally; one scans the sequence in the normal order (forward Mamba) and the other does so in the reversed order (backward Mamba). The decoder also applies bidirectional Mamba to the interleaved sequence, but queries in the decoder are derived from the outputs at the features' position (orange boxes) of the encoder's Mamba. Interleaving algorithms (right) are designed to introduce a psychoacoustically motivated inductive bias, motivated by the band-split module (Section \ref{['sssec:interleaving_strategy']}). The vertical boxes represent the data at a specific time frame $t$, whereas the shapes of the variables refer to those of the entire tensor across all $T$ time frames.
  • Figure 4: Scatter plot of uSDR [dB] values. Horizontal and vertical axes denote performances of small model with BS (A1) and SFC-CA (A3), respectively.
  • Figure 5: SFC-CA encoder's default positional bias $\bm{P}^{{\mathcal{E}}}$ defined in Eq. (\ref{['eq:position_bias']}) (top) and learned positional bias of E10 model in Table \ref{['table:ablation_inductive_bias']} in each head (bottom four). For better visualization, we show $\mathrm{Softmax}(\bm{P}^{{\mathcal{E}}})$ instead of raw $\bm{P}^{{\mathcal{E}}}$.
  • ...and 1 more figures