Two-step Automated Cybercrime Coded Word Detection using Multi-level Representation Learning

Yongyeon Kim; Byung-Won On; Ingyu Lee

Two-step Automated Cybercrime Coded Word Detection using Multi-level Representation Learning

Yongyeon Kim, Byung-Won On, Ingyu Lee

TL;DR

The paper tackles automatic detection and interpretation of cybercrime coded words (C3) in social media data under limited labeled resources. It introduces a two-step framework that first derives a mean latent vector $\bar{v_c}$ for each cybercrime type using one of five AutoEncoder models, then uses multi-level latent representations to detect C3 words by comparing sentence- and word-level embeddings to crime-type means with a threshold $\theta$. It further provides three analytical methods—outlier-based discovery of new terms, cross-crime word overlap, and automatic taxonomy generation—to deepen understanding of drug- and sex-crime vocabularies. Empirical results show the SAE-based two-step approach achieving a top F1 of $0.991$, outperforming dark-GloVe and dark-BERT, demonstrating practical potential for rapid, semi-supervised C3 word analysis and taxonomy construction. The work highlights implications for law enforcement analytics while noting data size and regional biases as limitations and emphasizing ethical considerations for deployment.

Abstract

In social network service platforms, crime suspects are likely to use cybercrime coded words for communication by adding criminal meanings to existing words or replacing them with similar words. For instance, the word 'ice' is often used to mean methamphetamine in drug crimes. To analyze the nature of cybercrime and the behavior of criminals, quickly detecting such words and further understanding their meaning are critical. In the automated cybercrime coded word detection problem, it is difficult to collect a sufficient amount of training data for supervised learning and to directly apply language models that utilize context information to better understand natural language. To overcome these limitations, we propose a new two-step approach, in which a mean latent vector is constructed for each cybercrime through one of five different AutoEncoder models in the first step, and cybercrime coded words are detected based on multi-level latent representations in the second step. Moreover, to deeply understand cybercrime coded words detected through the two-step approach, we propose three novel methods: (1) Detection of new words recently coined, (2) Detection of words frequently appeared in both drug and sex crimes, and (3) Automatic generation of word taxonomy. According to our experimental results, among various AutoEncoder models, the stacked AutoEncoder model shows the best performance. Additionally, the F1-score of the two-step approach is 0.991, which is higher than 0.987 and 0.903 of the existing dark-GloVe and dark-BERT models. By analyzing the experimental results of the three proposed methods, we can gain a deeper understanding of drug and sex crimes.

Two-step Automated Cybercrime Coded Word Detection using Multi-level Representation Learning

TL;DR

for each cybercrime type using one of five AutoEncoder models, then uses multi-level latent representations to detect C3 words by comparing sentence- and word-level embeddings to crime-type means with a threshold

. It further provides three analytical methods—outlier-based discovery of new terms, cross-crime word overlap, and automatic taxonomy generation—to deepen understanding of drug- and sex-crime vocabularies. Empirical results show the SAE-based two-step approach achieving a top F1 of

, outperforming dark-GloVe and dark-BERT, demonstrating practical potential for rapid, semi-supervised C3 word analysis and taxonomy construction. The work highlights implications for law enforcement analytics while noting data size and regional biases as limitations and emphasizing ethical considerations for deployment.

Abstract

Paper Structure (19 sections, 11 equations, 7 figures, 11 tables)

This paper contains 19 sections, 11 equations, 7 figures, 11 tables.

Introduction
Related Works
Proposed Model
Step 1: Construction of Mean Latent Vector by Type of Cybercrime
Stacked AutoEncoder (SAE)
Denoising AutoEncoder (DAE)
Stacked Denoising AutoEncoder (SDAE)
Variational AutoEncoder (VAE)
Adversarial AutoEncoder (AAE)
Step 2: Multi-level Latent Representations-based Cybercrime Coded Word Detection
Analysis of C3 Words
Detection of New C3 Words
Detection of C3 Words across Two Cybercrimes
Taxonomy of C3 Words
Experimental Results
...and 4 more sections

Figures (7)

Figure 1: Overview of the proposed two-step approach.
Figure 2: The proposed AutoEncoder model based on Bi-LSTM.
Figure 3: t-SNE visualization of latent vectors for C3 words detected through five AutoEncoder models.
Figure 4: C3 words related to drug and sex crimes.
Figure 5: Categories of drug (top) and sex crime (bottom) related words.
...and 2 more figures

Two-step Automated Cybercrime Coded Word Detection using Multi-level Representation Learning

TL;DR

Abstract

Two-step Automated Cybercrime Coded Word Detection using Multi-level Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)