TACNET: Temporal Audio Source Counting Network
Amirreza Ahmadnejad, Ahmad Mahmmodian Darviishani, Mohmmad Mehrdad Asadi, Sajjad Saffariyeh, Pedram Yousef, Emad Fatemizadeh
TL;DR
TaCNet introduces a learnable raw-audio front-end for temporal audio source counting, combining filtering, downsampling, and PCEN-like compression with a classifier to estimate the number of active speakers in short audio segments. Evaluated on LibriCount, it achieves state-of-the-art performance across 11 classes and demonstrates cross-language transfer to Chinese and Persian without accuracy loss, highlighting its generalizability. The model supports online, low-latency counting through a small 25 ms window and shows potential as a preprocessing module before audio separation tasks. Overall, TaCNet advances audio source counting by learning end-to-end features from raw audio, outperforming handcrafted-feature baselines and enabling broader applicability.
Abstract
In this paper, we introduce the Temporal Audio Source Counting Network (TaCNet), an innovative architecture that addresses limitations in audio source counting tasks. TaCNet operates directly on raw audio inputs, eliminating complex preprocessing steps and simplifying the workflow. Notably, it excels in real-time speaker counting, even with truncated input windows. Our extensive evaluation, conducted using the LibriCount dataset, underscores TaCNet's exceptional performance, positioning it as a state-of-the-art solution for audio source counting tasks. With an average accuracy of 74.18 percentage over 11 classes, TaCNet demonstrates its effectiveness across diverse scenarios, including applications involving Chinese and Persian languages. This cross-lingual adaptability highlights its versatility and potential impact.
