Vision Transformer Segmentation for Visual Bird Sound Denoising
Sahil Kumar, Jialu Li, Youshan Zhang
TL;DR
This work reframes bird sound denoising as an image segmentation problem by converting audio to spectrogram images and applying a vision-transformer–based encoder–decoder (ViTVS) to predict noise masks. The model employs STFT/ISTFT for audio-image construction and reconstruction, a 12-layer self-attention architecture with PatchEmbedding, and negative log-likelihood loss for robust segmentation. On the BirdSoundsDenoising dataset, ViTVS achieves state-of-the-art performance across segmentation metrics ($F1$, $IoU$, $Dice$) and the SDR-based denoising measure, with a 12-block configuration identified as optimal in ablation studies. The approach demonstrates strong generalization to natural environmental noises and offers a benchmark for real-world bird sound denoising, with public code available for replication.
Abstract
Audio denoising, especially in the context of bird sounds, remains a challenging task due to persistent residual noise. Traditional and deep learning methods often struggle with artificial or low-frequency noise. In this work, we propose ViTVS, a novel approach that leverages the power of the vision transformer (ViT) architecture. ViTVS adeptly combines segmentation techniques to disentangle clean audio from complex signal mixtures. Our key contributions encompass the development of ViTVS, introducing comprehensive, long-range, and multi-scale representations. These contributions directly tackle the limitations inherent in conventional approaches. Extensive experiments demonstrate that ViTVS outperforms state-of-the-art methods, positioning it as a benchmark solution for real-world bird sound denoising applications. Source code is available at: https://github.com/aiai-4/ViVTS.
