Turbo your multi-modal classification with contrastive learning
Zhiyu Zhang, Da Liu, Shengqiang Liu, Anna Wang, Jie Gao, Yali Li
TL;DR
Turbo addresses multi-modal classification by integrating in-modal and cross-modal contrastive learning through dropout-induced dual representations. It creates multiple in-modal and cross-modal objectives from a single input pair and trains Turbo alongside supervised classification as an auxiliary loss. Experimental results on audio-text tasks, including IEMOCAP SER and REJ DSD, demonstrate improved accuracy and state-of-the-art performance on IEMOCAP, with analyses showing better alignment and uniformity of representations. The approach offers stronger generalization and potential extension to larger pretraining and additional modalities.
Abstract
Contrastive learning has become one of the most impressive approaches for multi-modal representation learning. However, previous multi-modal works mainly focused on cross-modal understanding, ignoring in-modal contrastive learning, which limits the representation of each modality. In this paper, we propose a novel contrastive learning strategy, called $Turbo$, to promote multi-modal understanding by joint in-modal and cross-modal contrastive learning. Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality. With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training. Finally, we combine the self-supervised Turbo with the supervised multi-modal classification and demonstrate its effectiveness on two audio-text classification tasks, where the state-of-the-art performance is achieved on a speech emotion recognition benchmark dataset.
