Turbo your multi-modal classification with contrastive learning

Zhiyu Zhang; Da Liu; Shengqiang Liu; Anna Wang; Jie Gao; Yali Li

Turbo your multi-modal classification with contrastive learning

Zhiyu Zhang, Da Liu, Shengqiang Liu, Anna Wang, Jie Gao, Yali Li

TL;DR

Turbo addresses multi-modal classification by integrating in-modal and cross-modal contrastive learning through dropout-induced dual representations. It creates multiple in-modal and cross-modal objectives from a single input pair and trains Turbo alongside supervised classification as an auxiliary loss. Experimental results on audio-text tasks, including IEMOCAP SER and REJ DSD, demonstrate improved accuracy and state-of-the-art performance on IEMOCAP, with analyses showing better alignment and uniformity of representations. The approach offers stronger generalization and potential extension to larger pretraining and additional modalities.

Abstract

Contrastive learning has become one of the most impressive approaches for multi-modal representation learning. However, previous multi-modal works mainly focused on cross-modal understanding, ignoring in-modal contrastive learning, which limits the representation of each modality. In this paper, we propose a novel contrastive learning strategy, called $Turbo$, to promote multi-modal understanding by joint in-modal and cross-modal contrastive learning. Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality. With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training. Finally, we combine the self-supervised Turbo with the supervised multi-modal classification and demonstrate its effectiveness on two audio-text classification tasks, where the state-of-the-art performance is achieved on a speech emotion recognition benchmark dataset.

Turbo your multi-modal classification with contrastive learning

TL;DR

Abstract

, to promote multi-modal understanding by joint in-modal and cross-modal contrastive learning. Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality. With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training. Finally, we combine the self-supervised Turbo with the supervised multi-modal classification and demonstrate its effectiveness on two audio-text classification tasks, where the state-of-the-art performance is achieved on a speech emotion recognition benchmark dataset.

Paper Structure (12 sections, 6 equations, 3 figures, 2 tables)

This paper contains 12 sections, 6 equations, 3 figures, 2 tables.

Introduction
Method
Encoder with Dropout Mask
In-modal and Cross-modal Contrastive Learning
Supervised Classification with Turbo
Experiments
Datasets
Experimental Setup
Results
Analysis
Conclusions
Acknowledgements

Figures (3)

Figure 1: Overview of our proposed classification framework with Turbo
Figure 2: The $align$-$uniform$ plot of models
Figure 3: Feature distributions with Gaussian kernel density estimation (KDE) in $\mathbb{R}^{2}$.

Turbo your multi-modal classification with contrastive learning

TL;DR

Abstract

Turbo your multi-modal classification with contrastive learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)