Table of Contents
Fetching ...

N-Modal Contrastive Losses with Applications to Social Media Data in Trimodal Space

William Theisen, Walter Scheirer

TL;DR

An extension of the contrastive loss function to allow for any number of modalities, and its usefulness in trimodal spaces on social media is demonstrated, and a novel quadmodal CLIP model is displayed that can learn the interplay between text, image, video, and audio.

Abstract

The social media landscape of conflict dynamics has grown increasingly multi-modal. Recent advancements in model architectures such as CLIP have enabled researchers to begin studying the interplay between the modalities of text and images in a shared latent space. However, CLIP models fail to handle situations on social media when modalities present in a post expand above two. Social media dynamics often require understanding the interplay between not only text and images, but video as well. In this paper we explore an extension of the contrastive loss function to allow for any number of modalities, and demonstrate its usefulness in trimodal spaces on social media. By extending CLIP into three dimensions we can further aide understanding social media landscapes where all three modalities are present (an increasingly common situation). We use a newly collected public data set of Telegram posts containing all three modalities to train, and then demonstrate the usefulness of, a trimodal model in two OSINT scenarios: classifying a social media artifact post as either pro-Russian or pro-Ukrainian and identifying which account a given artifact originated from. While trimodal CLIP models have been explored before (though not on social media data), we also display a novel quadmodal CLIP model. This model can learn the interplay between text, image, video, and audio. We demonstrate new state-of-the-art baseline results on retrieval for quadmodel models moving forward.

N-Modal Contrastive Losses with Applications to Social Media Data in Trimodal Space

TL;DR

An extension of the contrastive loss function to allow for any number of modalities, and its usefulness in trimodal spaces on social media is demonstrated, and a novel quadmodal CLIP model is displayed that can learn the interplay between text, image, video, and audio.

Abstract

The social media landscape of conflict dynamics has grown increasingly multi-modal. Recent advancements in model architectures such as CLIP have enabled researchers to begin studying the interplay between the modalities of text and images in a shared latent space. However, CLIP models fail to handle situations on social media when modalities present in a post expand above two. Social media dynamics often require understanding the interplay between not only text and images, but video as well. In this paper we explore an extension of the contrastive loss function to allow for any number of modalities, and demonstrate its usefulness in trimodal spaces on social media. By extending CLIP into three dimensions we can further aide understanding social media landscapes where all three modalities are present (an increasingly common situation). We use a newly collected public data set of Telegram posts containing all three modalities to train, and then demonstrate the usefulness of, a trimodal model in two OSINT scenarios: classifying a social media artifact post as either pro-Russian or pro-Ukrainian and identifying which account a given artifact originated from. While trimodal CLIP models have been explored before (though not on social media data), we also display a novel quadmodal CLIP model. This model can learn the interplay between text, image, video, and audio. We demonstrate new state-of-the-art baseline results on retrieval for quadmodel models moving forward.
Paper Structure (8 sections, 4 equations, 4 figures, 5 tables)

This paper contains 8 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An intuitive visualization of contrastive loss expanded to a trimodal space, with the optimization happening across a cube of similarities rather than a 2-dimensional grid in order to account for the multi-modal properties of a social media post containing not only images and text, but video as well. After training a shared projection layer embeddings from all modalities are projected into a shared latent space, with artifacts from the same post being close to each other.
  • Figure 2: The evaluation method for measuring the recall of our models. The chosen embedding is compared only to those embeddings that are not of the same modality, to highlight the cross-modal abilities of the model. Similarites for the top-K embeddings are then summed together when they are from the same post.
  • Figure 3: The reciever-operator characteristics curve (ROC) for the stance classifiers on 10,000 posts, along with the area under curve for each classifier. As can be seen, all classifiers achieve results well above the baseline. The accuracy table on the right shows that Random Forests achieve the highest stance classification accuracy at 80.91%, though all methods other than Naive Bayes were within 1% of each other.
  • Figure 4: The ROC curves and AUC for the account classifiers, alongside the per-method classification accuracies. Random Forests achieved a 64.57% accuracy across 10,000 posts when using triCLIP-50k features. Much like with the binary classifier, Naive Bayes performed significantly worse than any of the other methods.