Table of Contents
Fetching ...

MATCHED: Multimodal Authorship-Attribution To Combat Human Trafficking in Escort-Advertisement Data

Vageesh Saxena, Benjamin Bashpole, Gijs Van Dijck, Gerasimos Spanakis

TL;DR

MATCHED introduces a multimodal authorship-attribution framework to combat human trafficking in escort-advertisement data by linking text descriptions and images of ads. It provides a novel MATCHED dataset (27,619 unique texts, 55,115 images, 3,549 vendors across 7 cities in 4 regions) and benchmarks text-only, vision-only, and multimodal models for vendor identification and verification, using multitask learning with a CE+SupCon objective. Key findings show that multimodal approaches improve performance beyond unimodal baselines, with end-to-end training (DeCLUTR-ViT CE+SupCon) delivering the strongest results, while cross-modal alignment strategies like CLIP and BLIP2 struggle due to low semantic overlap between ads. The work demonstrates the practical potential of multimodal AA for LEAs, while emphasizing domain-specific adaptations, privacy-preserving data handling, and the need for careful generalization to new platforms and regions.

Abstract

Human trafficking (HT) remains a critical issue, with traffickers increasingly leveraging online escort advertisements (ads) to advertise victims anonymously. Existing detection methods, including Authorship Attribution (AA), often center on text-based analyses and neglect the multimodal nature of online escort ads, which typically pair text with images. To address this gap, we introduce MATCHED, a multimodal dataset of 27,619 unique text descriptions and 55,115 unique images collected from the Backpage escort platform across seven U.S. cities in four geographical regions. Our study extensively benchmarks text-only, vision-only, and multimodal baselines for vendor identification and verification tasks, employing multitask (joint) training objectives that achieve superior classification and retrieval performance on in-distribution and out-of-distribution (OOD) datasets. Integrating multimodal features further enhances this performance, capturing complementary patterns across text and images. While text remains the dominant modality, visual data adds stylistic cues that enrich model performance. Moreover, text-image alignment strategies like CLIP and BLIP2 struggle due to low semantic overlap and vague connections between the modalities of escort ads, with end-to-end multimodal training proving more robust. Our findings emphasize the potential of multimodal AA (MAA) to combat HT, providing LEAs with robust tools to link ads and disrupt trafficking networks.

MATCHED: Multimodal Authorship-Attribution To Combat Human Trafficking in Escort-Advertisement Data

TL;DR

MATCHED introduces a multimodal authorship-attribution framework to combat human trafficking in escort-advertisement data by linking text descriptions and images of ads. It provides a novel MATCHED dataset (27,619 unique texts, 55,115 images, 3,549 vendors across 7 cities in 4 regions) and benchmarks text-only, vision-only, and multimodal models for vendor identification and verification, using multitask learning with a CE+SupCon objective. Key findings show that multimodal approaches improve performance beyond unimodal baselines, with end-to-end training (DeCLUTR-ViT CE+SupCon) delivering the strongest results, while cross-modal alignment strategies like CLIP and BLIP2 struggle due to low semantic overlap between ads. The work demonstrates the practical potential of multimodal AA for LEAs, while emphasizing domain-specific adaptations, privacy-preserving data handling, and the need for careful generalization to new platforms and regions.

Abstract

Human trafficking (HT) remains a critical issue, with traffickers increasingly leveraging online escort advertisements (ads) to advertise victims anonymously. Existing detection methods, including Authorship Attribution (AA), often center on text-based analyses and neglect the multimodal nature of online escort ads, which typically pair text with images. To address this gap, we introduce MATCHED, a multimodal dataset of 27,619 unique text descriptions and 55,115 unique images collected from the Backpage escort platform across seven U.S. cities in four geographical regions. Our study extensively benchmarks text-only, vision-only, and multimodal baselines for vendor identification and verification tasks, employing multitask (joint) training objectives that achieve superior classification and retrieval performance on in-distribution and out-of-distribution (OOD) datasets. Integrating multimodal features further enhances this performance, capturing complementary patterns across text and images. While text remains the dominant modality, visual data adds stylistic cues that enrich model performance. Moreover, text-image alignment strategies like CLIP and BLIP2 struggle due to low semantic overlap and vague connections between the modalities of escort ads, with end-to-end multimodal training proving more robust. Our findings emphasize the potential of multimodal AA (MAA) to combat HT, providing LEAs with robust tools to link ads and disrupt trafficking networks.

Paper Structure

This paper contains 100 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Comparison of retrieval performance across multiple baselines for text-to-text, image-to-image, and multimodal ads retrieval tasks on South, Midwest, West, and Northeast datasets. The text-to-text retrieval baselines include the pre-trained DeCLUTR checkpoint (●), DeCLUTR classifiers trained on CE (■) and CE+SupCon losses (■), and the DeCLUTR backbone trained with SupCon loss (■). Image-to-image retrieval baselines include the pre-trained ViT checkpoint (●), ViT classifiers trained on CE (■), CE+Triplet (■), and CE+SupCon losses (■), and ViT backbones trained with SupCon (■) and Triplet (■) losses. Multimodal baselines include End2End DeCLUTR-ViT classifiers trained with CE (), CE+SupCon (), and BLIP2-aligned DeCLUTR-ViT classifiers trained with CE+SupCon () objectives.
  • Figure 2:
  • Figure 3: Comparison of model performance among text-only, vision-only, and multimodal classifiers trained on the South region test dataset: (i) F1 score across different vendor IDs, (ii) Average F1 score for vendors with varying ad frequencies, (iii) Analysis of true and false positives, (iv) Average F1 score relative to the number of escort names (potentially representing different individuals) in vendor ads, and (v, vi) Average F1 score based on the number of vendor images with and without faces.
  • Figure 4: Comparison of retrieval performance on the South region test datasets. Text, vision, and multimodal baselines (DeCLUTR-small, ViT-base-patch16-224, and DeCLUTR-ViT, respectively) are trained end-to-end for vendor identification using the joint CE+SupCon objective on the South region dataset. M-Text and M-Vision represent text-only and image-only embeddings from the multimodal system. Vision-Face and Multimodal-Face denote evaluations of escort images with and without faces.
  • Figure 5: Comparison of retrieval performance on the Midwest region test datasets. Text, vision, and multimodal baselines (DeCLUTR-small, ViT-base-patch16-224, and DeCLUTR-ViT, respectively) are trained end-to-end for vendor identification using the joint CE+SupCon objective on the South region dataset. M-Text and M-Vision represent text-only and image-only embeddings from the multimodal system. Vision-Face and Multimodal-Face denote evaluations of escort images with and without faces.
  • ...and 3 more figures