Table of Contents
Fetching ...

Model Merging Improves Zero-Shot Generalization in Bioacoustic Foundation Models

Davide Marincione, Donato Crisostomi, Roberto Dessi, Emanuele Rodolà, Emanuele Rossi

TL;DR

Bioacoustic foundation models like NatureLM-audio excel at zero-shot tasks but lose instruction-following after domain-specific fine-tuning. The authors propose a lightweight model-merging approach that linearly interpolates the fine-tuned model with its base; this regains instruction-following while preserving bioacoustic expertise. The merged model achieves over 200% relative improvement on unseen-species zero-shot classification, establishing a new state-of-the-art for closed-set zero-shot classification. This work provides a practical strategy to balance domain adaptation with general instruction-following, with α around 0.4–0.6 offering a robust default.

Abstract

Foundation models capable of generalizing across species and tasks represent a promising new frontier in bioacoustics, with NatureLM being one of the most prominent examples. While its domain-specific fine-tuning yields strong performance on bioacoustic benchmarks, we observe that it also introduces trade-offs in instruction-following flexibility. For instance, NatureLM achieves high accuracy when prompted for either the common or scientific name individually, but its accuracy drops significantly when both are requested in a single prompt. We address this by applying a simple model merging strategy that interpolates NatureLM with its base language model, recovering instruction-following capabilities with minimal loss of domain expertise. Finally, we show that the merged model exhibits markedly stronger zero-shot generalization, achieving over a 200% relative improvement and setting a new state-of-the-art in closed-set zero-shot classification of unseen species.

Model Merging Improves Zero-Shot Generalization in Bioacoustic Foundation Models

TL;DR

Bioacoustic foundation models like NatureLM-audio excel at zero-shot tasks but lose instruction-following after domain-specific fine-tuning. The authors propose a lightweight model-merging approach that linearly interpolates the fine-tuned model with its base; this regains instruction-following while preserving bioacoustic expertise. The merged model achieves over 200% relative improvement on unseen-species zero-shot classification, establishing a new state-of-the-art for closed-set zero-shot classification. This work provides a practical strategy to balance domain adaptation with general instruction-following, with α around 0.4–0.6 offering a robust default.

Abstract

Foundation models capable of generalizing across species and tasks represent a promising new frontier in bioacoustics, with NatureLM being one of the most prominent examples. While its domain-specific fine-tuning yields strong performance on bioacoustic benchmarks, we observe that it also introduces trade-offs in instruction-following flexibility. For instance, NatureLM achieves high accuracy when prompted for either the common or scientific name individually, but its accuracy drops significantly when both are requested in a single prompt. We address this by applying a simple model merging strategy that interpolates NatureLM with its base language model, recovering instruction-following capabilities with minimal loss of domain expertise. Finally, we show that the merged model exhibits markedly stronger zero-shot generalization, achieving over a 200% relative improvement and setting a new state-of-the-art in closed-set zero-shot classification of unseen species.

Paper Structure

This paper contains 24 sections, 1 equation, 10 figures.

Figures (10)

  • Figure 1: Model merging leads to a 200% relative improvement over NatureLM-audio in zero-shot classification of unseen species, setting a new state-of-the-art.
  • Figure 2: NatureLM-audio classification accuracy for different prompts on Watkins and CBI.
  • Figure 3: Example model predictions for the common name, scientific name, and combined-name prompts, compared to ground truth. Correct predictions in green, incorrect in red.
  • Figure 4: Accuracy on the combined prompt (y-axis) from \ref{['fig:scientific_common_combined']} versus the mean accuracy on the individual common- and scientific-name prompts (x-axis) when varying $\alpha$.
  • Figure 5: Error breakdown (lower is better) on the binary subset of unseen-family-cmn as a function of the merging coefficient $\alpha$.
  • ...and 5 more figures