Table of Contents
Fetching ...

Shifting-Merging: Secure, High-Capacity and Efficient Steganography via Large Language Models

Minhao Bai, Jinshuai Yang, Kaiyi Pang, Yongfeng Huang, Yue Gao

TL;DR

ShiMer addresses the problem of secure, high-capacity text steganography by leveraging explicit next-token distributions from large language models. It encodes secret bits by pseudorandomly shifting and merging probability intervals, with decoding mirroring the process; a reordering step further reduces interval-splitting errors. The method achieves provable security, high embedding/utilization, and favorable channel capacity across multiple models, while maintaining text quality close to random sampling. This approach offers practical privacy protection in censorship-prone environments and can extend to other autoregressive domains, though it requires pre-shared keys or PRGs for operation and does not alter the model’s entropy. The work demonstrates that interval-shifting encoding can outperform prior secure steganography techniques in both capacity and efficiency, validated through comprehensive experiments and analyses, including a formal security justification via $D_{KL}(P_S||P_C) = 0$.

Abstract

In the face of escalating surveillance and censorship within the cyberspace, the sanctity of personal privacy has come under siege, necessitating the development of steganography, which offers a way to securely hide messages within innocent-looking texts. Previous methods alternate the texts to hide private massages, which is not secure. Large Language Models (LLMs) provide high-quality and explicit distribution, which is an available mathematical tool for secure steganography methods. However, existing attempts fail to achieve high capacity, time efficiency and correctness simultaneously, and their strongly coupling designs leave little room for refining them to achieve better performance. To provide a secure, high-capacity and efficient steganography method, we introduce ShiMer. Specifically, ShiMer pseudorandomly shifts the probability interval of the LLM's distribution to obtain a private distribution, and samples a token according to the private bits. ShiMer produced steganographic texts are indistinguishable in quality from the normal texts directly generated by the language model. To further enhance the capacity of ShiMer, we design a reordering algorithm to minimize the occurrence of interval splitting during decoding phase. Experimental results indicate that our method achieves the highest capacity and efficiency among existing secure steganography techniques.

Shifting-Merging: Secure, High-Capacity and Efficient Steganography via Large Language Models

TL;DR

ShiMer addresses the problem of secure, high-capacity text steganography by leveraging explicit next-token distributions from large language models. It encodes secret bits by pseudorandomly shifting and merging probability intervals, with decoding mirroring the process; a reordering step further reduces interval-splitting errors. The method achieves provable security, high embedding/utilization, and favorable channel capacity across multiple models, while maintaining text quality close to random sampling. This approach offers practical privacy protection in censorship-prone environments and can extend to other autoregressive domains, though it requires pre-shared keys or PRGs for operation and does not alter the model’s entropy. The work demonstrates that interval-shifting encoding can outperform prior secure steganography techniques in both capacity and efficiency, validated through comprehensive experiments and analyses, including a formal security justification via .

Abstract

In the face of escalating surveillance and censorship within the cyberspace, the sanctity of personal privacy has come under siege, necessitating the development of steganography, which offers a way to securely hide messages within innocent-looking texts. Previous methods alternate the texts to hide private massages, which is not secure. Large Language Models (LLMs) provide high-quality and explicit distribution, which is an available mathematical tool for secure steganography methods. However, existing attempts fail to achieve high capacity, time efficiency and correctness simultaneously, and their strongly coupling designs leave little room for refining them to achieve better performance. To provide a secure, high-capacity and efficient steganography method, we introduce ShiMer. Specifically, ShiMer pseudorandomly shifts the probability interval of the LLM's distribution to obtain a private distribution, and samples a token according to the private bits. ShiMer produced steganographic texts are indistinguishable in quality from the normal texts directly generated by the language model. To further enhance the capacity of ShiMer, we design a reordering algorithm to minimize the occurrence of interval splitting during decoding phase. Experimental results indicate that our method achieves the highest capacity and efficiency among existing secure steganography techniques.
Paper Structure (22 sections, 12 equations, 5 figures, 3 tables, 4 algorithms)

This paper contains 22 sections, 12 equations, 5 figures, 3 tables, 4 algorithms.

Figures (5)

  • Figure 1: An overview of ShiMer. The sender and reciever should share the same key, history and the pseudorandom generator (PRG). The sampling procedure is controlled by the message bits and the number generated by PRG. During the embedding process, the probability intervals of each sampled symbol will be shifted backward by the the number generated by PRG and then merged into a single interval. Then we can extract the shared prefix from this interval to get the encoded bits.
  • Figure 2: Interval splits. The encoder can find 2 separate intervals $[\Bar{B}_l,P_h-r)$ and $[\Bar{B}_h+P_l-r,\Bar{B}_h)$.
  • Figure 3: Possible permutations in the worst cases. The purple gap between large yellow intervals can be shifted to different regions, in which our reorder algorithm can reduce the probability of error.
  • Figure 4: Metrics of METEOR, DISCOP and ShiMer when sampling from top-100 $\sim$ 400 symbols.
  • Figure 5: A simple example of encoding.