Table of Contents
Fetching ...

TACNET: Temporal Audio Source Counting Network

Amirreza Ahmadnejad, Ahmad Mahmmodian Darviishani, Mohmmad Mehrdad Asadi, Sajjad Saffariyeh, Pedram Yousef, Emad Fatemizadeh

TL;DR

TaCNet introduces a learnable raw-audio front-end for temporal audio source counting, combining filtering, downsampling, and PCEN-like compression with a classifier to estimate the number of active speakers in short audio segments. Evaluated on LibriCount, it achieves state-of-the-art performance across 11 classes and demonstrates cross-language transfer to Chinese and Persian without accuracy loss, highlighting its generalizability. The model supports online, low-latency counting through a small 25 ms window and shows potential as a preprocessing module before audio separation tasks. Overall, TaCNet advances audio source counting by learning end-to-end features from raw audio, outperforming handcrafted-feature baselines and enabling broader applicability.

Abstract

In this paper, we introduce the Temporal Audio Source Counting Network (TaCNet), an innovative architecture that addresses limitations in audio source counting tasks. TaCNet operates directly on raw audio inputs, eliminating complex preprocessing steps and simplifying the workflow. Notably, it excels in real-time speaker counting, even with truncated input windows. Our extensive evaluation, conducted using the LibriCount dataset, underscores TaCNet's exceptional performance, positioning it as a state-of-the-art solution for audio source counting tasks. With an average accuracy of 74.18 percentage over 11 classes, TaCNet demonstrates its effectiveness across diverse scenarios, including applications involving Chinese and Persian languages. This cross-lingual adaptability highlights its versatility and potential impact.

TACNET: Temporal Audio Source Counting Network

TL;DR

TaCNet introduces a learnable raw-audio front-end for temporal audio source counting, combining filtering, downsampling, and PCEN-like compression with a classifier to estimate the number of active speakers in short audio segments. Evaluated on LibriCount, it achieves state-of-the-art performance across 11 classes and demonstrates cross-language transfer to Chinese and Persian without accuracy loss, highlighting its generalizability. The model supports online, low-latency counting through a small 25 ms window and shows potential as a preprocessing module before audio separation tasks. Overall, TaCNet advances audio source counting by learning end-to-end features from raw audio, outperforming handcrafted-feature baselines and enabling broader applicability.

Abstract

In this paper, we introduce the Temporal Audio Source Counting Network (TaCNet), an innovative architecture that addresses limitations in audio source counting tasks. TaCNet operates directly on raw audio inputs, eliminating complex preprocessing steps and simplifying the workflow. Notably, it excels in real-time speaker counting, even with truncated input windows. Our extensive evaluation, conducted using the LibriCount dataset, underscores TaCNet's exceptional performance, positioning it as a state-of-the-art solution for audio source counting tasks. With an average accuracy of 74.18 percentage over 11 classes, TaCNet demonstrates its effectiveness across diverse scenarios, including applications involving Chinese and Persian languages. This cross-lingual adaptability highlights its versatility and potential impact.
Paper Structure (14 sections, 9 equations, 6 figures, 1 table)

This paper contains 14 sections, 9 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of the TaCNet Model: The TaCNet model operates through a structured sequence of operations. Initially, the audio file is segmented into a defined number of partitions. Subsequently, these partitions undergo feature extraction through a dedicated block comprising three stages: filtering, downsampling, and compression. The extracted features are then directed to the classification block, which ultimately serves to ascertain the number of speakers present in the input audio.
  • Figure 2: The general outline of the model architecture. Initially, the audio signal, which is sampled over a 25-millisecond window and has a length of 1*3200, is inputted into the 1D convolution layer. Subsequently, Gabor filters are applied to the signal, and it is then fed into the classifier, as explained in the architecture.
  • Figure 3: Within the LibriCount dataset, we present an illustrative instance featuring four distinct speakers engaged in discourse. The initial segment focuses on the uninterrupted vocal presence of the first speaker throughout the entire duration of the audio recording. The subsequent segments are dedicated to the second, third, and fourth speakers, respectively, each exhibiting periods of vocal inactivity. The final segment portrays the composite (overlapping) representation of these speaker activities, encapsulating a heterogeneous ensemble of speakers distributed across various temporal segments within the audio file.
  • Figure 4: Mean Absolute Error (MAE) observed across various window sizes exhibits a notable pattern. Notably, there is an initial decrease in the MAE, followed by a subsequent increase. It is worth highlighting that the minimum error is associated with a window size of 25 milliseconds.
  • Figure 5: The confusion matrix illustrates the performance of the speaker counting system on the test set. The diagonal elements represent the frequency of correct counts, while the off-diagonal elements indicate erroneous counts. Specifically, the main diagonal contains the number of test segments for which the estimated speaker count matched the true number of speakers. The off-diagonal elements show the number of instances where the system incorrectly estimated the speaker count.
  • ...and 1 more figures