Table of Contents
Fetching ...

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

Xianghu Yue, Xiaohai Tian, Lu Lu, Malu Zhang, Zhizheng Wu, Haizhou Li

TL;DR

CoAVT tackles the challenge of unified tri-modal learning by aligning audio, visual, and text through a cognition-inspired architecture that separates non-verbal and verbal processing yet couples them with a learnable query encoder. The model comprises a joint audio-visual encoder, a text encoder, and a query encoder that mediates cross-modal interactions, reinforced by AV-T, A-T, and V-T alignments and three losses: contrastive, matching, and language modeling. Empirically, CoAVT achieves state-of-the-art results on text-video retrieval (AudioCaps) in zero-shot and fine-tuned settings, as well as on audio-visual event classification and audio-visual retrieval on AudioSet and VGGSound, demonstrating strong cross-modal correlations and generalization. The approach highlights the value of cross-modal alignments and a bridging mechanism for robust multimodal representations, with potential for broader tri-modal tasks in multimedia understanding.

Abstract

There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

TL;DR

CoAVT tackles the challenge of unified tri-modal learning by aligning audio, visual, and text through a cognition-inspired architecture that separates non-verbal and verbal processing yet couples them with a learnable query encoder. The model comprises a joint audio-visual encoder, a text encoder, and a query encoder that mediates cross-modal interactions, reinforced by AV-T, A-T, and V-T alignments and three losses: contrastive, matching, and language modeling. Empirically, CoAVT achieves state-of-the-art results on text-video retrieval (AudioCaps) in zero-shot and fine-tuned settings, as well as on audio-visual event classification and audio-visual retrieval on AudioSet and VGGSound, demonstrating strong cross-modal correlations and generalization. The approach highlights the value of cross-modal alignments and a bridging mechanism for robust multimodal representations, with potential for broader tri-modal tasks in multimedia understanding.

Abstract

There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.
Paper Structure (32 sections, 7 equations, 3 figures, 7 tables)

This paper contains 32 sections, 7 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: The dual coding theory of human cognition proposed by Paivio dual1.
  • Figure 2: The overview of our proposed CoAVT model, which consists of a joint audio-visual encoder, a text encoder and a query encoder, which contains a set of learnable query embeddings. The query encoder partly shares parameters with text encoder except the cross-attention layers. The red dashed box shows the pre-training objectives of our CoAVT, which are calculated on three pair-wise losses, including AV-T, A-T, and V-T. Each pair consists of contrastive loss, matching loss and language modeling loss.
  • Figure 3: Qualitative results of video-to-text retrieval on AudioCaps.