Table of Contents
Fetching ...

MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning

Hang Zhao, Yifei Xin, Zhesong Yu, Bilei Zhu, Lu Lu, Zejun Ma

TL;DR

MINT tackles the modality gap in audio-language pre-training by freezing both the audio encoder and the language model and introducing Bridge-Net to align the modalities. It combines multi-target pre-training with instruction tuning, incorporating objectives for alignment, matching, and grounded generation, plus an instruction-aware feature extraction mechanism. Bridge-Net serves as a bottleneck that feeds task-relevant audio features to a frozen LLM, enabling robust zero-shot transfer to both discriminative and generative tasks. Empirical results show MINT achieves state-of-the-art or competitive performance across audio classification, retrieval, and captioning, with strong zero-shot generalization and without extensive task-specific fine-tuning.

Abstract

In the realm of audio-language pre-training (ALP), the challenge of achieving cross-modal alignment is significant. Moreover, the integration of audio inputs with diverse distributions and task variations poses challenges in developing generic audio-language models. In this study, we present MINT, a novel ALP framework boosting audio-language models through multi-target pre-training and instruction tuning. MINT leverages the strength of frozen pre-trained audio encoders and large language models (LLM) to improve audio-language pre-training, enabling effective transferablility to both audio-text understanding and generation tasks. To address the modality gap, we introduce Bridge-Net, a trainable module that enhances cross-modality alignment and the model's ability to follow instructions for a variety of audio-text tasks. Bridge-Net is pivotal within MINT, initially enhancing audio-language representation learning through a multi-target pre-training approach. Subsequently, Bridge-Net further boosts audio-to-language generative learning by integrating a frozen language model with instruction tuning. This integration empowers MINT to extract features in a flexible and effective manner, specifically tailored to the provided instructions for diverse tasks. Experimental results demonstrate that MINT attains superior performance across various audio-language understanding and generation tasks, highlighting its robust generalization capabilities even in zero-shot scenarios.

MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning

TL;DR

MINT tackles the modality gap in audio-language pre-training by freezing both the audio encoder and the language model and introducing Bridge-Net to align the modalities. It combines multi-target pre-training with instruction tuning, incorporating objectives for alignment, matching, and grounded generation, plus an instruction-aware feature extraction mechanism. Bridge-Net serves as a bottleneck that feeds task-relevant audio features to a frozen LLM, enabling robust zero-shot transfer to both discriminative and generative tasks. Empirical results show MINT achieves state-of-the-art or competitive performance across audio classification, retrieval, and captioning, with strong zero-shot generalization and without extensive task-specific fine-tuning.

Abstract

In the realm of audio-language pre-training (ALP), the challenge of achieving cross-modal alignment is significant. Moreover, the integration of audio inputs with diverse distributions and task variations poses challenges in developing generic audio-language models. In this study, we present MINT, a novel ALP framework boosting audio-language models through multi-target pre-training and instruction tuning. MINT leverages the strength of frozen pre-trained audio encoders and large language models (LLM) to improve audio-language pre-training, enabling effective transferablility to both audio-text understanding and generation tasks. To address the modality gap, we introduce Bridge-Net, a trainable module that enhances cross-modality alignment and the model's ability to follow instructions for a variety of audio-text tasks. Bridge-Net is pivotal within MINT, initially enhancing audio-language representation learning through a multi-target pre-training approach. Subsequently, Bridge-Net further boosts audio-to-language generative learning by integrating a frozen language model with instruction tuning. This integration empowers MINT to extract features in a flexible and effective manner, specifically tailored to the provided instructions for diverse tasks. Experimental results demonstrate that MINT attains superior performance across various audio-language understanding and generation tasks, highlighting its robust generalization capabilities even in zero-shot scenarios.
Paper Structure (14 sections, 2 figures, 6 tables)

This paper contains 14 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Model architecture of our MINT’s audio-language representation learning.
  • Figure 2: MINT's instruction tuning process.