Table of Contents
Fetching ...

Steganography Beyond Space-Time with Chain of Multimodal AI

Ching-Chun Chang, Isao Echizen

TL;DR

This work addresses the vulnerability of traditional spatial-temporal steganography to advanced generative AI manipulations by conceiving a linguistic-domain embedding that operates on the textual layer of audiovisual content. It proposes a chain of multimodal AI that demultiplexes cover media, transcribes audio to text, embeds a message via biased word sampling in a paraphrase generator using a shared key, then reconstructs the stego audiovisuals through voice cloning and lip-syncing. The approach supports zero-bit and multi-bit capacity, evaluates fidelity via biometric and semantic similarity, and assesses secrecy with statistical analyses while testing robustness against resampling, face-swapping, and voice-cloning. Key contributions include an end-to-end open-source–based pipeline, quantitative demonstrations of high biometric/semantic fidelity, and robustness against common and adversarial perturbations, highlighting the potential and risks of invariant-domain steganography in multimodal content.

Abstract

Steganography is the art and science of covert writing, with a broad range of applications interwoven within the realm of cybersecurity. As artificial intelligence continues to evolve, its ability to synthesise realistic content emerges as a threat in the hands of cybercriminals who seek to manipulate and misrepresent the truth. Such synthetic content introduces a non-trivial risk of overwriting the subtle changes made for the purpose of steganography. When the signals in both the spatial and temporal domains are vulnerable to unforeseen overwriting, it calls for reflection on what, if any, remains invariant. This study proposes a paradigm in steganography for audiovisual media, where messages are concealed beyond both spatial and temporal domains. A chain of multimodal artificial intelligence is developed to deconstruct audiovisual content into a cover text, embed a message within the linguistic domain, and then reconstruct the audiovisual content through synchronising both auditory and visual modalities with the resultant stego text. The message is encoded by biasing the word sampling process of a language generation model and decoded by analysing the probability distribution of word choices. The accuracy of message transmission is evaluated under both zero-bit and multi-bit capacity settings. Fidelity is assessed through both biometric and semantic similarities, capturing the identities of the recorded face and voice, as well as the core ideas conveyed through the media. Secrecy is examined through statistical comparisons between cover and stego texts. Robustness is tested across various scenarios, including audiovisual resampling, face-swapping, voice-cloning and their combinations.

Steganography Beyond Space-Time with Chain of Multimodal AI

TL;DR

This work addresses the vulnerability of traditional spatial-temporal steganography to advanced generative AI manipulations by conceiving a linguistic-domain embedding that operates on the textual layer of audiovisual content. It proposes a chain of multimodal AI that demultiplexes cover media, transcribes audio to text, embeds a message via biased word sampling in a paraphrase generator using a shared key, then reconstructs the stego audiovisuals through voice cloning and lip-syncing. The approach supports zero-bit and multi-bit capacity, evaluates fidelity via biometric and semantic similarity, and assesses secrecy with statistical analyses while testing robustness against resampling, face-swapping, and voice-cloning. Key contributions include an end-to-end open-source–based pipeline, quantitative demonstrations of high biometric/semantic fidelity, and robustness against common and adversarial perturbations, highlighting the potential and risks of invariant-domain steganography in multimodal content.

Abstract

Steganography is the art and science of covert writing, with a broad range of applications interwoven within the realm of cybersecurity. As artificial intelligence continues to evolve, its ability to synthesise realistic content emerges as a threat in the hands of cybercriminals who seek to manipulate and misrepresent the truth. Such synthetic content introduces a non-trivial risk of overwriting the subtle changes made for the purpose of steganography. When the signals in both the spatial and temporal domains are vulnerable to unforeseen overwriting, it calls for reflection on what, if any, remains invariant. This study proposes a paradigm in steganography for audiovisual media, where messages are concealed beyond both spatial and temporal domains. A chain of multimodal artificial intelligence is developed to deconstruct audiovisual content into a cover text, embed a message within the linguistic domain, and then reconstruct the audiovisual content through synchronising both auditory and visual modalities with the resultant stego text. The message is encoded by biasing the word sampling process of a language generation model and decoded by analysing the probability distribution of word choices. The accuracy of message transmission is evaluated under both zero-bit and multi-bit capacity settings. Fidelity is assessed through both biometric and semantic similarities, capturing the identities of the recorded face and voice, as well as the core ideas conveyed through the media. Secrecy is examined through statistical comparisons between cover and stego texts. Robustness is tested across various scenarios, including audiovisual resampling, face-swapping, voice-cloning and their combinations.

Paper Structure

This paper contains 10 sections, 15 equations, 9 figures.

Figures (9)

  • Figure 1: Overview of the message encoding process with a shared key, converting a cover multimedia container into a stego multimedia container.
  • Figure 2: Overview of the message decoding process with a shared key, making a binary decision on a query multimedia container.
  • Figure 3: Evaluation of accuracy in zero-bit capacity setting.
  • Figure 4: Evaluation of accuracy in multi-bit capacity setting.
  • Figure 5: Evaluation of fidelity in terms of biometric and semantic similarities.
  • ...and 4 more figures