Table of Contents
Fetching ...

MAviS: A Multimodal Conversational Assistant For Avian Species

Yevheniia Kryklyvets, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jinxing Zhou, Fahad Shabzan Khan, Rao Anwer, Salman Khan, Hisham Cholakkal

TL;DR

The MAviS-Dataset is introduced, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, and MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation.

Abstract

Fine-grained understanding and species-specific multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the MAviS-Dataset, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question-answer pairs. Building on the MAviS-Dataset, we introduce MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present MAviS-Bench, a benchmark of over 25,000 QA pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive multimodal LLMs for ecological applications.

MAviS: A Multimodal Conversational Assistant For Avian Species

TL;DR

The MAviS-Dataset is introduced, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, and MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation.

Abstract

Fine-grained understanding and species-specific multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the MAviS-Dataset, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question-answer pairs. Building on the MAviS-Dataset, we introduce MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present MAviS-Bench, a benchmark of over 25,000 QA pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive multimodal LLMs for ecological applications.
Paper Structure (27 sections, 7 figures, 5 tables)

This paper contains 27 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of our MAviS suite. The map illustrates the global distribution of over 1,000 bird species included in the proposed MAviS-Dataset, spanning 199 countries, with geographic coverage indicated by red triangles. Six annotated Q&A examples are displayed, each paired with image, audio, and text data, highlighting the multimodal and conversational nature. The bottom panel summarises key components of the dataset. MAviS-Dataset is organised into pretraining, instruction-tuning, and evaluation sets (MAviS-Bench), covering diverse recognition and reasoning question types related to visual attributes, audio-based emotions, habitat and food habits, offering a valuable foundation for developing MAviS-Chat, the proposed multimodal conversational assistant for avian species.
  • Figure 2: Pipeline for generating image-text and audio-text annotations in the pretraining dataset. The process combines large-scale public resources with structured AI-assisted enrichment to ensure semantic accuracy and species-specific grounding.
  • Figure 3: The curation process for the fine-tuning dataset, detailing the sources, annotation, and refinement steps to ensure high-quality alignment. Broader multimodal Q&A types are shown in Figure \ref{['fig:map']} and detailed in Section \ref{['sec:fine_tuning_set']}.
  • Figure 4: Distribution of Image and Audio Samples in the Pretraining and Fine-Tuning Sets.
  • Figure 5: Additional multimodal question–answer samples from MAviS-Dataset, covering ecological inference such as eating habits, visual behaviour, and appearance interpretation, as well as detailed understanding of bird emotional tone and calling types from audio.
  • ...and 2 more figures