Table of Contents
Fetching ...

Keyword-Oriented Multimodal Modeling for Euphemism Identification

Yuxue Hu, Junsong Li, Meixuan Chen, Dongyu Su, Tongguan Wang, Ying Sha

TL;DR

This paper tackles euphemism identification by moving beyond text-only approaches to a keyword-oriented multimodal framework. It introduces KOM-Euph, the first large-scale text-image-speech euphemism corpus across Drug, Weapon, and Sexuality domains, and KOM-EI, a model that aligns and fuses textual, visual, and audio signals through cross-modal contrastive learning, cross-attention, and gating. The method achieves state-of-the-art performance and superior efficiency compared with large language models, demonstrating the value of multimodal data for disambiguating euphemisms in dangerous or illicit content. This work advances content moderation capabilities and provides a foundation for more robust analyses of evolving euphemisms in multimedia contexts.

Abstract

Euphemism identification deciphers the true meaning of euphemisms, such as linking "weed" (euphemism) to "marijuana" (target keyword) in illicit texts, aiding content moderation and combating underground markets. While existing methods are primarily text-based, the rise of social media highlights the need for multimodal analysis, incorporating text, images, and audio. However, the lack of multimodal datasets for euphemisms limits further research. To address this, we regard euphemisms and their corresponding target keywords as keywords and first introduce a keyword-oriented multimodal corpus of euphemisms (KOM-Euph), involving three datasets (Drug, Weapon, and Sexuality), including text, images, and speech. We further propose a keyword-oriented multimodal euphemism identification method (KOM-EI), which uses cross-modal feature alignment and dynamic fusion modules to explicitly utilize the visual and audio features of the keywords for efficient euphemism identification. Extensive experiments demonstrate that KOM-EI outperforms state-of-the-art models and large language models, and show the importance of our multimodal datasets.

Keyword-Oriented Multimodal Modeling for Euphemism Identification

TL;DR

This paper tackles euphemism identification by moving beyond text-only approaches to a keyword-oriented multimodal framework. It introduces KOM-Euph, the first large-scale text-image-speech euphemism corpus across Drug, Weapon, and Sexuality domains, and KOM-EI, a model that aligns and fuses textual, visual, and audio signals through cross-modal contrastive learning, cross-attention, and gating. The method achieves state-of-the-art performance and superior efficiency compared with large language models, demonstrating the value of multimodal data for disambiguating euphemisms in dangerous or illicit content. This work advances content moderation capabilities and provides a foundation for more robust analyses of evolving euphemisms in multimedia contexts.

Abstract

Euphemism identification deciphers the true meaning of euphemisms, such as linking "weed" (euphemism) to "marijuana" (target keyword) in illicit texts, aiding content moderation and combating underground markets. While existing methods are primarily text-based, the rise of social media highlights the need for multimodal analysis, incorporating text, images, and audio. However, the lack of multimodal datasets for euphemisms limits further research. To address this, we regard euphemisms and their corresponding target keywords as keywords and first introduce a keyword-oriented multimodal corpus of euphemisms (KOM-Euph), involving three datasets (Drug, Weapon, and Sexuality), including text, images, and speech. We further propose a keyword-oriented multimodal euphemism identification method (KOM-EI), which uses cross-modal feature alignment and dynamic fusion modules to explicitly utilize the visual and audio features of the keywords for efficient euphemism identification. Extensive experiments demonstrate that KOM-EI outperforms state-of-the-art models and large language models, and show the importance of our multimodal datasets.

Paper Structure

This paper contains 32 sections, 18 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Image and speech examples of keywords.
  • Figure 2: The left part illustrates the self-supervised learning scheme for constructing labeled training sets, where sentences with masked target keywords are labeled and enriched with multimodal information. The right part shows the architecture of our KOM-EI, which consists of three modules: (1) Feature Representation extracts text, image, and speech features using pre-trained models; (2) Feature Fusion dynamically aligns and integrates multimodal features via co-attention mechanisms; and (3) Prediction identifies target keywords based on fused features.
  • Figure 3: Samples of multimodal datasets.
  • Figure 4: Representation distribution of multimodal data and target keywords before and after fusing.
  • Figure 5: Instruction given to GPT-4o for euphemism recognition tasks.
  • ...and 7 more figures