Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks

Danae Sánchez Villegas; Daniel Preoţiuc-Pietro; Nikolaos Aletras

Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks

Danae Sánchez Villegas, Daniel Preoţiuc-Pietro, Nikolaos Aletras

TL;DR

This paper tackles the challenge of cross-modal semantics in social media by fine-tuning multimodal models with two auxiliary losses: Image-Text Contrastive (ITC) and Image-Text Matching (ITM). ITC pulls image-text representations of the same post closer, while ITM reinforces semantic alignment between image and text, especially when their relationship is implicit or ambiguous, and both are incorporated into a joint loss $l_{C+M} = \lambda_1 l_{CE} + \lambda_2 l_{ITC} + \lambda_3 l_{ITM}$. The authors conduct an extensive evaluation across five diverse English datasets and four base multimodal models, showing improvements up to 2.6 F1 points and providing insights into when each auxiliary task is most beneficial. The approach offers a practical, low-cost enhancement to multimodal social media classification without additional pre-training, with implications for sentiment, hate-speech, and sarcasm detection in real-world applications. Limitations include language scope and increased training time, with future work pointing to multilingual extension and broader model compatibility.

Abstract

Effectively leveraging multimodal information from social media posts is essential to various downstream tasks such as sentiment analysis, sarcasm detection or hate speech classification. Jointly modeling text and images is challenging because cross-modal semantics might be hidden or the relation between image and text is weak. However, prior work on multimodal classification of social media posts has not yet addressed these challenges. In this work, we present an extensive study on the effectiveness of using two auxiliary losses jointly with the main task during fine-tuning multimodal models. First, Image-Text Contrastive (ITC) is designed to minimize the distance between image-text representations within a post, thereby effectively bridging the gap between posts where the image plays an important role in conveying the post's meaning. Second, Image-Text Matching (ITM) enhances the model's ability to understand the semantic relationship between images and text, thus improving its capacity to handle ambiguous or loosely related modalities. We combine these objectives with five multimodal models across five diverse social media datasets, demonstrating consistent improvements of up to 2.6 points F1. Our comprehensive analysis shows the specific scenarios where each auxiliary task is most effective.

Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks

TL;DR

. The authors conduct an extensive evaluation across five diverse English datasets and four base multimodal models, showing improvements up to 2.6 F1 points and providing insights into when each auxiliary task is most beneficial. The approach offers a practical, low-cost enhancement to multimodal social media classification without additional pre-training, with implications for sentiment, hate-speech, and sarcasm detection in real-world applications. Limitations include language scope and increased training time, with future work pointing to multilingual extension and broader model compatibility.

Abstract

Paper Structure (39 sections, 1 equation, 4 figures, 3 tables)

This paper contains 39 sections, 1 equation, 4 figures, 3 tables.

Introduction
Multimodal Auxiliary Tasks
Image-Text Contrastive (ITC)
Image-Text Matching (ITM)
Joint Fine-tuning Objectives
Experimental Setup
Datasets
Single Modality Methods
Text-only
Image-only
Multimodal Models
Ber-ViT
MMBT
LXMERT
ViLT
...and 24 more sections

Figures (4)

Figure 1: Image-text relations in social media posts from vempala-preotiuc-pietro-2019-categorizing and corresponding image captions generated with InstructBLIP. While image captions have a clear visual-language connection, image-text relationships in social media posts may no be apparent.
Figure 2: Results in weighted F1 using Ber-ViT-Att (ATT) for all datasets when training with different percentages of training data. We plot the mean and standard deviation across three runs.
Figure 3: Accuracy per label using Ber-ViT-Att (ATT) across different image-text relation types based on image contribution to the post's meaning and text representation on the image.
Figure 4: Bert-ViT-Att (ATT) predictions on randomly selected examples with varying image-text relations.

Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks

TL;DR

Abstract

Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)