Table of Contents
Fetching ...

Impromptu Cybercrime Euphemism Detection

Xiang Li, Yucheng Zhou, Laiping Zhao, Jing Li, Fangming Liu

TL;DR

This work tackles the challenge of detecting impromptu cybercrime euphemisms in online text by introducing the ICED dataset and a two-stage CAMIT framework that combines coarse-grained filtering with a fine-grained, context-aware detector. CAMIT leverages context augmentation and multi-round iterative training, complemented by LLM-guided dev-set generation for robust evaluation. Key contributions include constructing a realistic, three-part ICED corpus, a Word2Vec-based candidate masking strategy, and a joint MLM+CAM objective that enhances contextual understanding of euphemisms. Empirically, CAMIT achieves a $76$-fold improvement over prior state-of-the-art euphemism detectors, with ablations confirming the importance of context modeling, iterative training, and backbone choice for performance gains.

Abstract

Detecting euphemisms is essential for content security on various social media platforms, but existing methods designed for detecting euphemisms are ineffective in impromptu euphemisms. In this work, we make a first attempt to an exploration of impromptu euphemism detection and introduce the Impromptu Cybercrime Euphemisms Detection (ICED) dataset. Moreover, we propose a detection framework tailored to this problem, which employs context augmentation modeling and multi-round iterative training. Our detection framework mainly consists of a coarse-grained and a fine-grained classification model. The coarse-grained classification model removes most of the harmless content in the corpus to be detected. The fine-grained model, impromptu euphemisms detector, integrates context augmentation and multi-round iterations training to better predicts the actual meaning of a masked token. In addition, we leverage ChatGPT to evaluate the mode's capability. Experimental results demonstrate that our approach achieves a remarkable 76-fold improvement compared to the previous state-of-the-art euphemism detector.

Impromptu Cybercrime Euphemism Detection

TL;DR

This work tackles the challenge of detecting impromptu cybercrime euphemisms in online text by introducing the ICED dataset and a two-stage CAMIT framework that combines coarse-grained filtering with a fine-grained, context-aware detector. CAMIT leverages context augmentation and multi-round iterative training, complemented by LLM-guided dev-set generation for robust evaluation. Key contributions include constructing a realistic, three-part ICED corpus, a Word2Vec-based candidate masking strategy, and a joint MLM+CAM objective that enhances contextual understanding of euphemisms. Empirically, CAMIT achieves a -fold improvement over prior state-of-the-art euphemism detectors, with ablations confirming the importance of context modeling, iterative training, and backbone choice for performance gains.

Abstract

Detecting euphemisms is essential for content security on various social media platforms, but existing methods designed for detecting euphemisms are ineffective in impromptu euphemisms. In this work, we make a first attempt to an exploration of impromptu euphemism detection and introduce the Impromptu Cybercrime Euphemisms Detection (ICED) dataset. Moreover, we propose a detection framework tailored to this problem, which employs context augmentation modeling and multi-round iterative training. Our detection framework mainly consists of a coarse-grained and a fine-grained classification model. The coarse-grained classification model removes most of the harmless content in the corpus to be detected. The fine-grained model, impromptu euphemisms detector, integrates context augmentation and multi-round iterations training to better predicts the actual meaning of a masked token. In addition, we leverage ChatGPT to evaluate the mode's capability. Experimental results demonstrate that our approach achieves a remarkable 76-fold improvement compared to the previous state-of-the-art euphemism detector.

Paper Structure

This paper contains 21 sections, 10 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The Construction Pipeline for ICED dataset.
  • Figure 2: The training and inference pipeline of our method. Training flow is represented by the blue line. Inference flow is denoted by the orange line.
  • Figure 3: Training for fine-grained classification entails two components: mask language modeling (MLM), represented by the gray line, and context augmentation modeling (CAM), as denoted by the blue line.
  • Figure 4: Prompt of generating euphemisms based on the seeds.
  • Figure 5: Prompt of creating samples, both with and without euphemisms.
  • ...and 3 more figures