Table of Contents
Fetching ...

On the Challenges of Creating Datasets for Analyzing Commercial Sex Advertisements to Assess Human Trafficking Risk and Organized Activity

Pablo Rivas, Tomas Cerny, Alejandro Rodriguez Perez, Javier Turek, Laurie Giddens, Gisela Bichler, Stacie Petter

TL;DR

The paper addresses the challenge of building datasets to study human trafficking risk and organized activity in online commercial sex advertisements, highlighting data scarcity, rapid obsolescence, and privacy concerns. It proposes a reproducible, automated pipeline that scrapes and normalizes ads, deduplicates from 5,053,249 to 515,865 unique items, trains Longformer/XLNet-based NER on 1,810 labeled ads, and constructs a Relatedness Graph to support pseudo-labeling for OAD and HTRP, using a Levenshtein threshold of $0.5$. The resulting graph reveals sparse connectivity with large components, illustrating non-trivial associations while exposing dataset biases and limitations. The work contributes a privacy-conscious, scalable methodology and reproducible protocol for constructing datasets in sensitive domains, without releasing raw data, under NSF funding.

Abstract

Our study addresses the challenges of building datasets to understand the risks associated with organized activities and human trafficking through commercial sex advertisements. These challenges include data scarcity, rapid obsolescence, and privacy concerns. Traditional approaches, which are not automated and are difficult to reproduce, fall short in addressing these issues. We have developed a reproducible and automated methodology to analyze five million advertisements. In the process, we identified further challenges in dataset creation within this sensitive domain. This paper presents a streamlined methodology to assist researchers in constructing effective datasets for combating organized crime, allowing them to focus on advancing detection technologies.

On the Challenges of Creating Datasets for Analyzing Commercial Sex Advertisements to Assess Human Trafficking Risk and Organized Activity

TL;DR

The paper addresses the challenge of building datasets to study human trafficking risk and organized activity in online commercial sex advertisements, highlighting data scarcity, rapid obsolescence, and privacy concerns. It proposes a reproducible, automated pipeline that scrapes and normalizes ads, deduplicates from 5,053,249 to 515,865 unique items, trains Longformer/XLNet-based NER on 1,810 labeled ads, and constructs a Relatedness Graph to support pseudo-labeling for OAD and HTRP, using a Levenshtein threshold of . The resulting graph reveals sparse connectivity with large components, illustrating non-trivial associations while exposing dataset biases and limitations. The work contributes a privacy-conscious, scalable methodology and reproducible protocol for constructing datasets in sensitive domains, without releasing raw data, under NSF funding.

Abstract

Our study addresses the challenges of building datasets to understand the risks associated with organized activities and human trafficking through commercial sex advertisements. These challenges include data scarcity, rapid obsolescence, and privacy concerns. Traditional approaches, which are not automated and are difficult to reproduce, fall short in addressing these issues. We have developed a reproducible and automated methodology to analyze five million advertisements. In the process, we identified further challenges in dataset creation within this sensitive domain. This paper presents a streamlined methodology to assist researchers in constructing effective datasets for combating organized crime, allowing them to focus on advancing detection technologies.
Paper Structure (7 sections, 4 figures, 1 table)

This paper contains 7 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Methodology to generate a pseudo-labeled dataset in human trafficking risk prediction and organized activity detection tasks.
  • Figure 2: Processing the text of an ad with the NER pipeline. Personally identifiable data has been changed.
  • Figure 3: Ads connected indirectly in a connected component. (a) Description text of the highlighted posts. Personally identifiable data has been changed. (b) Connected component graph where referred posts are in orange.
  • Figure 4: A connected component labeled positive due to several phone numbers found. Each color represents a different phone number encountered.