Table of Contents
Fetching ...

Zwitscherkasten -- DIY Audiovisual bird monitoring

Dominik Blum, Elias Häring, Fabian Jirges, Martin Schäffer, David Schick, Florian Schulenberg, Torsten Schön

TL;DR

Zwitscherkasten, a DiY, multimodal system for bird species monitoring using audio and visual data on edge devices shows that accurate bird species identification is feasible on embedded platforms, supporting scalable biodiversity monitoring and citizen science applications.

Abstract

This paper presents Zwitscherkasten, a DiY, multimodal system for bird species monitoring using audio and visual data on edge devices. Deep learning models for bioacoustic and image-based classification are deployed on resource-constrained hardware, enabling real-time, non-invasive monitoring. An acoustic activity detector reduces energy consumption, while visual recognition is performed using fine-grained detection and classification pipelines. Results show that accurate bird species identification is feasible on embedded platforms, supporting scalable biodiversity monitoring and citizen science applications.

Zwitscherkasten -- DIY Audiovisual bird monitoring

TL;DR

Zwitscherkasten, a DiY, multimodal system for bird species monitoring using audio and visual data on edge devices shows that accurate bird species identification is feasible on embedded platforms, supporting scalable biodiversity monitoring and citizen science applications.

Abstract

This paper presents Zwitscherkasten, a DiY, multimodal system for bird species monitoring using audio and visual data on edge devices. Deep learning models for bioacoustic and image-based classification are deployed on resource-constrained hardware, enabling real-time, non-invasive monitoring. An acoustic activity detector reduces energy consumption, while visual recognition is performed using fine-grained detection and classification pipelines. Results show that accurate bird species identification is feasible on embedded platforms, supporting scalable biodiversity monitoring and citizen science applications.
Paper Structure (38 sections, 2 equations, 7 figures, 3 tables)

This paper contains 38 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of the audio preprocessing pipeline. (top) Raw mono waveform decoded from the original recording (first 10 s shown). (bottom) Corresponding mel spectrogram after resampling to 32 kHz, log-scaled feature extraction, temporal standardization to 1000 frames, and PaSST-style normalization.
  • Figure 2: Class distribution of the dataset incl 256 classes cut-off.
  • Figure 3: Accuracy comparison for audio classification architectures on 256 bird species.
  • Figure 4: Illustration of the preprocessing pipeline using YOLOv11-based bird detection followed by image cropping. Four example images sourced from iNaturalist inaturalist2024 are shown, demonstrating varying original image sizes and resulting crops: Buteo buteo (Observation ID: 325504686, Photo ID: 590714713, CC-BY), Erithacus rubecula (Observation ID: 331794775, Photo ID: 601986829, CC-BY), Falco tinnunculus (Observation ID: 322740349, Photo ID: 583899233, CC0), Passer domesticus (Observation ID: 333737125, Photo ID: 605910979, CC0). Bounding boxes and cropped images were generated by the authors.
  • Figure 5: Class distribution of the dataset after object detection and crop generation for the two-stage pipeline.
  • ...and 2 more figures