Table of Contents
Fetching ...

Vision Transformer Segmentation for Visual Bird Sound Denoising

Sahil Kumar, Jialu Li, Youshan Zhang

TL;DR

This work reframes bird sound denoising as an image segmentation problem by converting audio to spectrogram images and applying a vision-transformer–based encoder–decoder (ViTVS) to predict noise masks. The model employs STFT/ISTFT for audio-image construction and reconstruction, a 12-layer self-attention architecture with PatchEmbedding, and negative log-likelihood loss for robust segmentation. On the BirdSoundsDenoising dataset, ViTVS achieves state-of-the-art performance across segmentation metrics ($F1$, $IoU$, $Dice$) and the SDR-based denoising measure, with a 12-block configuration identified as optimal in ablation studies. The approach demonstrates strong generalization to natural environmental noises and offers a benchmark for real-world bird sound denoising, with public code available for replication.

Abstract

Audio denoising, especially in the context of bird sounds, remains a challenging task due to persistent residual noise. Traditional and deep learning methods often struggle with artificial or low-frequency noise. In this work, we propose ViTVS, a novel approach that leverages the power of the vision transformer (ViT) architecture. ViTVS adeptly combines segmentation techniques to disentangle clean audio from complex signal mixtures. Our key contributions encompass the development of ViTVS, introducing comprehensive, long-range, and multi-scale representations. These contributions directly tackle the limitations inherent in conventional approaches. Extensive experiments demonstrate that ViTVS outperforms state-of-the-art methods, positioning it as a benchmark solution for real-world bird sound denoising applications. Source code is available at: https://github.com/aiai-4/ViVTS.

Vision Transformer Segmentation for Visual Bird Sound Denoising

TL;DR

This work reframes bird sound denoising as an image segmentation problem by converting audio to spectrogram images and applying a vision-transformer–based encoder–decoder (ViTVS) to predict noise masks. The model employs STFT/ISTFT for audio-image construction and reconstruction, a 12-layer self-attention architecture with PatchEmbedding, and negative log-likelihood loss for robust segmentation. On the BirdSoundsDenoising dataset, ViTVS achieves state-of-the-art performance across segmentation metrics (, , ) and the SDR-based denoising measure, with a 12-block configuration identified as optimal in ablation studies. The approach demonstrates strong generalization to natural environmental noises and offers a benchmark for real-world bird sound denoising, with public code available for replication.

Abstract

Audio denoising, especially in the context of bird sounds, remains a challenging task due to persistent residual noise. Traditional and deep learning methods often struggle with artificial or low-frequency noise. In this work, we propose ViTVS, a novel approach that leverages the power of the vision transformer (ViT) architecture. ViTVS adeptly combines segmentation techniques to disentangle clean audio from complex signal mixtures. Our key contributions encompass the development of ViTVS, introducing comprehensive, long-range, and multi-scale representations. These contributions directly tackle the limitations inherent in conventional approaches. Extensive experiments demonstrate that ViTVS outperforms state-of-the-art methods, positioning it as a benchmark solution for real-world bird sound denoising applications. Source code is available at: https://github.com/aiai-4/ViVTS.
Paper Structure (17 sections, 14 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 14 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: The overview of our ViTVS architecture. The encoder comprises a sequence of self-attention encoder blocks, each executing normalization, patch creation, and embedding layers. The decoder mirrors the encoder with additional operations, including unfolding and output projection, culminating in the final segmentation map. Both encoder and decoder consist of 12 blocks.
  • Figure 2: Segmentation results comparisons. The leftmost column is the original audio image. The ground truth is the labeled mask.