Leveraging Habitat Information for Fine-grained Bird Identification

Tin Nguyen; Peijie Chen; Anh Totti Nguyen

Leveraging Habitat Information for Fine-grained Bird Identification

Tin Nguyen, Peijie Chen, Anh Totti Nguyen

TL;DR

This work demonstrates that habitat information, a core ornithological cue, can meaningfully improve fine-grained bird identification across both vision-only and vision-language models. By introducing habitat-aware data augmentation (Mixed-S, Mixed-G, Mixed-I) for CNNs/ViTs and habitat descriptions into CLIP prompts, the authors achieve robust, cross-dataset gains on NABirds, CUB-200, and iNaturalist-birds benchmarks. The results show consistent improvements in accuracy, especially under challenging conditions (background variation, occlusions, and small or partially visible birds), and reveal that both model families face similar habitat-related challenges. The study also highlights limitations due to habitat data availability and suggests future directions in part-based recognition and cross-region transferability to broaden applicability of habitat-informed bird identification systems.

Abstract

Traditional bird classifiers mostly rely on the visual characteristics of birds. Some prior works even train classifiers to be invariant to the background, completely discarding the living environment of birds. Instead, we are the first to explore integrating habitat information, one of the four major cues for identifying birds by ornithologists, into modern bird classifiers. We focus on two leading model types: (1) CNNs and ViTs trained on the downstream bird datasets; and (2) original, multi-modal CLIP. Training CNNs and ViTs with habitat-augmented data results in an improvement of up to +0.83 and +0.23 points on NABirds and CUB-200, respectively. Similarly, adding habitat descriptors to the prompts for CLIP yields a substantial accuracy boost of up to +0.99 and +1.1 points on NABirds and CUB-200, respectively. We find consistent accuracy improvement after integrating habitat features into the image augmentation process and into the textual descriptors of vision-language CLIP classifiers. Code is available at: https://anonymous.4open.science/r/reasoning-8B7E/.

Leveraging Habitat Information for Fine-grained Bird Identification

TL;DR

Abstract

Paper Structure (42 sections, 12 figures, 7 tables, 1 algorithm)

This paper contains 42 sections, 12 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Utilizing visual features in vision-only models
Adding extra information as another modality
Background Information
Habitat Classification
Methods
Improve Habitat Understanding in Vision-Only Models with Habitat-augmented Data
Mixed-Same (Mixed-S)
Mixed-Group (Mixed-G)
Mixed-Irrelevant (Mixed-I)
Habitat Understanding in Multimodal Models
Results
Dataset used in vision-only models
Dataset used in CLIP
...and 27 more sections

Figures (12)

Figure 1: Top: CNN trained on with habitat-augmented data improves accuracy by +0.83 pts over original data. Bottom: Adding habitat descriptions to CLIP boosts zero-shot accuracy, exceeding visually-based or class name-only descriptions by +0.99 pts and +1.90 pts, respectively (details in \ref{['tab:combined_performance']}, \ref{['tab:multimodal_cub_nabirds']}, and \ref{['tab:multimodal_inat']}). Note that, both models are tested on NABirds.
Figure 2: Visual comparison of two bird pairs with similar morphologies but different habitats: Acadian Flycatcher in swamps vs. Least Flycatcher in woodland edges; Scott Oriole in deserts vs. Evening Grosbeak in pine-oak areas. See Appendix \ref{['sec:more_bird_pair']} for details.
Figure 3: Three augmentation techniques (Mixed-S, Mixed-G, Mixed-I) are illustrated: Original bird images in the first column, followed by augmented versions. The first row shows Common Yellow Throat in varying habitats (marsh to grassland). The second row features it amidst different species sharing the same habitats. The last row demonstrates Mixed-Irrelevant, placing Black Footed Albatross (typically found near shores) in forest and grass backgrounds.
Figure 4: Bird classification with vision-only models (CNN, ViT) utilizes augmented datasets blending original and habitat-augmented images. The augmented images have more contextual habitat, for instance, the habitats of Painted Bunting and Scott Oriole are dense brush and arid foothills.
Figure 5: Integrating habitat data into CLIP during zero-shot enhances bird identification. Each class comes with descriptions; CLIP calculates and averages similarity scores between these and the input image. The class with the highest softmax score is then predicted.
...and 7 more figures

Leveraging Habitat Information for Fine-grained Bird Identification

TL;DR

Abstract

Leveraging Habitat Information for Fine-grained Bird Identification

Authors

TL;DR

Abstract

Table of Contents

Figures (12)