Table of Contents
Fetching ...

A Split-Window Transformer for Multi-Model Sequence Spammer Detection using Multi-Model Variational Autoencoder

Zhou Yang, Yucai Pang, Hongbo Yin, Yunpeng Xiao

TL;DR

The paper tackles multi-modal spammer detection over ultra-long historical sequences. It introduces MS$^2$Dformer, a Transformer backbone that combines MVAE-based two-channel tokenization of multi-modal user history with a hierarchical split-window attention mechanism (SW-MHA) to efficiently model ultra-long sequences. The model spans four stages: MVAE-based tokenization, intra-window SW-MHA and inter-window W-MHA for short- and long-term dependencies, deeper sequence feature mining, and a classifier head, trained with a total loss that fuses MVAE reconstruction losses with cross-entropy. Empirical results on Weibo datasets show state-of-the-art accuracy and efficiency, validating the approach as a strong backbone for real-world multi-modal sequence spammer detection and highlighting its potential for other ultra-long sequence tasks.

Abstract

This paper introduces a new Transformer, called MS$^2$Dformer, that can be used as a generalized backbone for multi-modal sequence spammer detection. Spammer detection is a complex multi-modal task, thus the challenges of applying Transformer are two-fold. Firstly, complex multi-modal noisy information about users can interfere with feature mining. Secondly, the long sequence of users' historical behaviors also puts a huge GPU memory pressure on the attention computation. To solve these problems, we first design a user behavior Tokenization algorithm based on the multi-modal variational autoencoder (MVAE). Subsequently, a hierarchical split-window multi-head attention (SW/W-MHA) mechanism is proposed. The split-window strategy transforms the ultra-long sequences hierarchically into a combination of intra-window short-term and inter-window overall attention. Pre-trained on the public datasets, MS$^2$Dformer's performance far exceeds the previous state of the art. The experiments demonstrate MS$^2$Dformer's ability to act as a backbone.

A Split-Window Transformer for Multi-Model Sequence Spammer Detection using Multi-Model Variational Autoencoder

TL;DR

The paper tackles multi-modal spammer detection over ultra-long historical sequences. It introduces MSDformer, a Transformer backbone that combines MVAE-based two-channel tokenization of multi-modal user history with a hierarchical split-window attention mechanism (SW-MHA) to efficiently model ultra-long sequences. The model spans four stages: MVAE-based tokenization, intra-window SW-MHA and inter-window W-MHA for short- and long-term dependencies, deeper sequence feature mining, and a classifier head, trained with a total loss that fuses MVAE reconstruction losses with cross-entropy. Empirical results on Weibo datasets show state-of-the-art accuracy and efficiency, validating the approach as a strong backbone for real-world multi-modal sequence spammer detection and highlighting its potential for other ultra-long sequence tasks.

Abstract

This paper introduces a new Transformer, called MSDformer, that can be used as a generalized backbone for multi-modal sequence spammer detection. Spammer detection is a complex multi-modal task, thus the challenges of applying Transformer are two-fold. Firstly, complex multi-modal noisy information about users can interfere with feature mining. Secondly, the long sequence of users' historical behaviors also puts a huge GPU memory pressure on the attention computation. To solve these problems, we first design a user behavior Tokenization algorithm based on the multi-modal variational autoencoder (MVAE). Subsequently, a hierarchical split-window multi-head attention (SW/W-MHA) mechanism is proposed. The split-window strategy transforms the ultra-long sequences hierarchically into a combination of intra-window short-term and inter-window overall attention. Pre-trained on the public datasets, MSDformer's performance far exceeds the previous state of the art. The experiments demonstrate MSDformer's ability to act as a backbone.

Paper Structure

This paper contains 22 sections, 24 equations, 9 figures, 7 tables, 2 algorithms.

Figures (9)

  • Figure 1: Some examples of complex cross-modal feature mining challenges. (a) and (b) are both standard cases, i.e., the core argument can be identified in the text modal, and the multi-modal provides auxiliary features. (c) the text modal does not directly provide the point of view, so further alignment and mining of the core argument in conjunction with the image is required. (d) the situation is most common in social behavior. User behavior from the text aspect is often accompanied by noise features, i.e., emojis and non-common characters (@, #, and //, etc.). In special cases, it is also mixed with URL linking to elaborate the argument.
  • Figure 2: Inspiration for hierarchical attention mechanisms. (a) is the classical multi-head attention mechanism (MHA). (b-h) are sparse multi-head attention (SMHA) for solving ultra-long sequence modeling. (b) and (c) are SMHAs based on split windows. The core idea of them all is to limit the receptive field of an individual element, thus reducing the computational effort of $QK^\text{T}$. Among them, (c) expands the windowed receptive field similarly to the dilated convolution. Because the CPU dominates the windowing process, the full SMHA computation is slow (see Table \ref{['table-ab-memory']}). To solve this problem, the researchers adjusted the sliding distance to be consistent with the window length, thus proposing the block split-window mechanism (see (d) and (h)). Meanwhile, considering the importance of CLS tokens, Longformer (ebeltagy2020longformer) proposes global attention based on (d). Subsequently, random sampling is also added (see (f) and (g)).
  • Figure 3: The framework of the MS$^2$Dformer_B model (MS$^2$Dformer Base Version, see Eq. (\ref{['eq-base']})).
  • Figure 4: The case of SW/W-MHA.
  • Figure 5: Statistics from two publicly available datasets.
  • ...and 4 more figures