Table of Contents
Fetching ...

Pseudo-label Based Domain Adaptation for Zero-Shot Text Steganalysis

Yufei Luo, Zhen Yang, Ru Zhang, Jianyi Liu

TL;DR

Zero-shot cross-domain text steganalysis faces challenges from limited labeled data and domain shift across corpora. The authors propose PDTS, a two-stage framework that combines a BERT-based domain-agnostic feature extractor with a single-layer Bi-LSTM domain-specific extractor, a feature filtration network, and a classifier. Training uses labeled source-domain data for pre-training and unlabeled target-domain data with progressively expanded pseudo-labels (expansion parameter $p$, default $p=0.1$) for fine-tuning. Empirical results on Twitter, Movie, and News datasets show PDTS achieves higher detection accuracy and F1 than MDA and SANet, particularly at higher embedding rates, indicating strong zero-shot transfer and robustness. This work reduces reliance on labeled target-domain data and offers a practical approach for real-world text steganalysis under domain shift.

Abstract

Currently, most methods for text steganalysis are based on deep neural networks (DNNs). However, in real-life scenarios, obtaining a sufficient amount of labeled stego-text for correctly training networks using a large number of parameters is often challenging and costly. Additionally, due to a phenomenon known as dataset bias or domain shift, recognition models trained on a large dataset exhibit poor generalization performance on novel datasets and tasks. Therefore, to address the issues of missing labeled data and inadequate model generalization in text steganalysis, this paper proposes a cross-domain stego-text analysis method (PDTS) based on pseudo-labeling and domain adaptation (unsupervised learning). Specifically, we propose a model architecture combining pre-trained BERT with a single-layer Bi-LSTM to learn and extract generic features across tasks and generate task-specific representations. Considering the differential contributions of different features to steganalysis, we further design a feature filtering mechanism to achieve selective feature propagation, thereby enhancing classification performance. We train the model using labeled source domain data and adapt it to target domain data distribution using pseudo-labels for unlabeled target domain data through self-training. In the label estimation step, instead of using a static sampling strategy, we propose a progressive sampling strategy to gradually increase the number of selected pseudo-label candidates. Experimental results demonstrate that our method performs well in zero-shot text steganalysis tasks, achieving high detection accuracy even in the absence of labeled data in the target domain, and outperforms current zero-shot text steganalysis methods.

Pseudo-label Based Domain Adaptation for Zero-Shot Text Steganalysis

TL;DR

Zero-shot cross-domain text steganalysis faces challenges from limited labeled data and domain shift across corpora. The authors propose PDTS, a two-stage framework that combines a BERT-based domain-agnostic feature extractor with a single-layer Bi-LSTM domain-specific extractor, a feature filtration network, and a classifier. Training uses labeled source-domain data for pre-training and unlabeled target-domain data with progressively expanded pseudo-labels (expansion parameter , default ) for fine-tuning. Empirical results on Twitter, Movie, and News datasets show PDTS achieves higher detection accuracy and F1 than MDA and SANet, particularly at higher embedding rates, indicating strong zero-shot transfer and robustness. This work reduces reliance on labeled target-domain data and offers a practical approach for real-world text steganalysis under domain shift.

Abstract

Currently, most methods for text steganalysis are based on deep neural networks (DNNs). However, in real-life scenarios, obtaining a sufficient amount of labeled stego-text for correctly training networks using a large number of parameters is often challenging and costly. Additionally, due to a phenomenon known as dataset bias or domain shift, recognition models trained on a large dataset exhibit poor generalization performance on novel datasets and tasks. Therefore, to address the issues of missing labeled data and inadequate model generalization in text steganalysis, this paper proposes a cross-domain stego-text analysis method (PDTS) based on pseudo-labeling and domain adaptation (unsupervised learning). Specifically, we propose a model architecture combining pre-trained BERT with a single-layer Bi-LSTM to learn and extract generic features across tasks and generate task-specific representations. Considering the differential contributions of different features to steganalysis, we further design a feature filtering mechanism to achieve selective feature propagation, thereby enhancing classification performance. We train the model using labeled source domain data and adapt it to target domain data distribution using pseudo-labels for unlabeled target domain data through self-training. In the label estimation step, instead of using a static sampling strategy, we propose a progressive sampling strategy to gradually increase the number of selected pseudo-label candidates. Experimental results demonstrate that our method performs well in zero-shot text steganalysis tasks, achieving high detection accuracy even in the absence of labeled data in the target domain, and outperforms current zero-shot text steganalysis methods.

Paper Structure

This paper contains 17 sections, 11 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The overall structure of PDTS for cross-domain text steganalysis consists of two main stages. The first stage, represented by the blue line, involves pre-training using labeled source domain data to initialize model parameters. Following this, the second stage, indicated by the green line, is the fine-tuning process. Here, the pre-trained model generates pseudo-labels for unlabeled target domain data. These refined pseudo-labeled data are then used to further fine-tune the model, enhancing its adaptability and accuracy on target domain data.
  • Figure 2: Illustration of the two-stage process of pre-training and fine-tuning.The 'epoch' in the figure refers to the number of times pseudo-labels are selected and used to fine-tune the model. $m_t$ refers to the number of pseudo-labels selected in each round of fine-tuning training, which gradually increases with each training round.
  • Figure 3: In the six cross-domain tasks, the features extracted by PDTS were visualized. In the visualization, blue dots represent cover text, while red dots represent steganographic text.

Theorems & Definitions (1)

  • definition 1