Table of Contents
Fetching ...

ASGIR: Audio Spectrogram Transformer Guided Classification And Information Retrieval For Birds

Yashwardhan Chaudhuri, Paridhi Mundra, Arnesh Batra, Orchid Chetia Phukan, Arun Balaji Buduru

TL;DR

ASGIR tackles robust bird sound recognition and information retrieval by integrating an Audio Spectrogram Transformer (AST) with a downstream linear SVM classifier and a location-aware web-scraping module that gathers habitat and characteristic data from Wikipedia. The two-step IR system first narrows the search space using geographic context and then retrieves species-specific information for recognized birds, enabling contextual knowledge alongside identification. Evaluation on a 51-class European subset from Xeno-Canto shows strong class-wise performance, with ablations indicating AST+SVM as the most effective pair and location-based narrowing providing incremental accuracy gains. The approach offers a practical pipeline for ecological monitoring and ornithological research, complemented by an accessible codebase for replication and extension.

Abstract

Recognition and interpretation of bird vocalizations are pivotal in ornithological research and ecological conservation efforts due to their significance in understanding avian behaviour, performing habitat assessment and judging ecological health. This paper presents an audio spectrogram-guided classification framework called ASGIR for improved bird sound recognition and information retrieval. Our work is accompanied by a simple-to-use, two-step information retrieval system that uses geographical location and bird sounds to localize and retrieve relevant bird information by scraping Wikipedia page information of recognized birds. ASGIR offers a substantial performance on a random subset of 51 classes of Xeno-Canto dataset Bird sounds from European countries with a median of 100\% performance on F1, Precision and Sensitivity metrics. Our code is available as follows: https://github.com/MainSample1234/AS-GIR .

ASGIR: Audio Spectrogram Transformer Guided Classification And Information Retrieval For Birds

TL;DR

ASGIR tackles robust bird sound recognition and information retrieval by integrating an Audio Spectrogram Transformer (AST) with a downstream linear SVM classifier and a location-aware web-scraping module that gathers habitat and characteristic data from Wikipedia. The two-step IR system first narrows the search space using geographic context and then retrieves species-specific information for recognized birds, enabling contextual knowledge alongside identification. Evaluation on a 51-class European subset from Xeno-Canto shows strong class-wise performance, with ablations indicating AST+SVM as the most effective pair and location-based narrowing providing incremental accuracy gains. The approach offers a practical pipeline for ecological monitoring and ornithological research, complemented by an accessible codebase for replication and extension.

Abstract

Recognition and interpretation of bird vocalizations are pivotal in ornithological research and ecological conservation efforts due to their significance in understanding avian behaviour, performing habitat assessment and judging ecological health. This paper presents an audio spectrogram-guided classification framework called ASGIR for improved bird sound recognition and information retrieval. Our work is accompanied by a simple-to-use, two-step information retrieval system that uses geographical location and bird sounds to localize and retrieve relevant bird information by scraping Wikipedia page information of recognized birds. ASGIR offers a substantial performance on a random subset of 51 classes of Xeno-Canto dataset Bird sounds from European countries with a median of 100\% performance on F1, Precision and Sensitivity metrics. Our code is available as follows: https://github.com/MainSample1234/AS-GIR .
Paper Structure (4 sections, 1 figure, 2 tables)

This paper contains 4 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: ASGIR Workflow: The user starts by entering 1. audio recording and 2.) location of the audio recording from a drop-down menu. The model narrows the search space based on the location of the recording and then runs the ASGIR audio classifier to detect bird names. We use the bird name to scrape information about its habitat and characteristics by parsing HTML tag information from Wikipedia pages.