Table of Contents
Fetching ...

Efficient Streaming Voice Steganalysis in Challenging Detection Scenarios

Pengcheng Zhou, Zhengyang Fang, Zhongliang Yang, Zhili Zhou, Linna Zhou

TL;DR

This paper introduces a Dual-View VoIP Steganalysis Framework (DVSF), a framework that randomly obfuscates parts of the native steganographic descriptors in VoIP stream segments, making the steganographic features of hard-to-detect samples more pronounced and easier to learn.

Abstract

In recent years, there has been an increasing number of information hiding techniques based on network streaming media, focusing on how to covertly and efficiently embed secret information into real-time transmitted network media signals to achieve concealed communication. The misuse of these techniques can lead to significant security risks, such as the spread of malicious code, commands, and viruses. Current steganalysis methods for network voice streams face two major challenges: efficient detection under low embedding rates and short duration conditions. These challenges arise because, with low embedding rates (e.g., as low as 10%) and short transmission durations (e.g., only 0.1 second), detection models struggle to acquire sufficiently rich sample features, making effective steganalysis difficult. To address these challenges, this paper introduces a Dual-View VoIP Steganalysis Framework (DVSF). The framework first randomly obfuscates parts of the native steganographic descriptors in VoIP stream segments, making the steganographic features of hard-to-detect samples more pronounced and easier to learn. It then captures fine-grained local features related to steganography, building on the global features of VoIP. Specially constructed VoIP segment triplets further adjust the feature distances within the model. Ultimately, this method effectively address the detection difficulty in VoIP. Extensive experiments demonstrate that our method significantly improves the accuracy of streaming voice steganalysis in these challenging detection scenarios, surpassing existing state-of-the-art methods and offering superior near-real-time performance.

Efficient Streaming Voice Steganalysis in Challenging Detection Scenarios

TL;DR

This paper introduces a Dual-View VoIP Steganalysis Framework (DVSF), a framework that randomly obfuscates parts of the native steganographic descriptors in VoIP stream segments, making the steganographic features of hard-to-detect samples more pronounced and easier to learn.

Abstract

In recent years, there has been an increasing number of information hiding techniques based on network streaming media, focusing on how to covertly and efficiently embed secret information into real-time transmitted network media signals to achieve concealed communication. The misuse of these techniques can lead to significant security risks, such as the spread of malicious code, commands, and viruses. Current steganalysis methods for network voice streams face two major challenges: efficient detection under low embedding rates and short duration conditions. These challenges arise because, with low embedding rates (e.g., as low as 10%) and short transmission durations (e.g., only 0.1 second), detection models struggle to acquire sufficiently rich sample features, making effective steganalysis difficult. To address these challenges, this paper introduces a Dual-View VoIP Steganalysis Framework (DVSF). The framework first randomly obfuscates parts of the native steganographic descriptors in VoIP stream segments, making the steganographic features of hard-to-detect samples more pronounced and easier to learn. It then captures fine-grained local features related to steganography, building on the global features of VoIP. Specially constructed VoIP segment triplets further adjust the feature distances within the model. Ultimately, this method effectively address the detection difficulty in VoIP. Extensive experiments demonstrate that our method significantly improves the accuracy of streaming voice steganalysis in these challenging detection scenarios, surpassing existing state-of-the-art methods and offering superior near-real-time performance.

Paper Structure

This paper contains 21 sections, 11 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: In the VoIP steganalysis scenario, it is necessary to intercept the real-time voice streams transmitted over the network and efficiently analyze and determine whether they contain concealed information.
  • Figure 2: An overview of the proposed framework DVSF, which consists of three integral components: the CutMix module, the Hybrid Attention Model (HAM), and the Joint Training Strategy (JTS), encompassing both the training and prediction processes. The VoIP segment triplet features are solely the outputs of the model during the training phase and are not required in the prediction phase.
  • Figure 3: The Joint Training Strategy promotes the alignment and uniformity of feature distribution in the model feature space by influencing the feature distances within positive and negative VoIP segment pairs, thereby making the feature space linearly separable.
  • Figure 4: For VoIP segment datasets $D_e$ and $D_s$, the detection accuracy varies with the embedding rate and segment length.
  • Figure 5: The distribution of speech features extracted by the proposed method and the best existing methods in statistical space varies with the embedding rate. Each of the dots represents a speech with a length of 1s, the orange points indicate cover speeches, and the blue points indicate stego speeches with different embedding rate concealment information.