Table of Contents
Fetching ...

A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography

Yapei Feng, Feng Jiang, Shanhao Wu, Hua Zhong

TL;DR

This work tackles tokenization-induced decoding failures in neural linguistic steganography by introducing Look-ahead Sync, a recursive disambiguation algorithm that preserves non-payload entropy to reclaim embedding capacity while maintaining computational zero-KL security. The method builds on a distribution-preserving disambiguation framework and optimizes the trade-off between security and capacity by performing minimal synchronized sampling only on truly indistinguishable sequences. Theoretical results yield a capacity upper bound under zero-KL security and quantify the residual gap, while empirical evaluations on English and Chinese benchmarks show substantial capacity gains (up to over 160% in some settings) with zero KL divergence and high text quality. The approach significantly advances practical high-capacity provably secure linguistic steganography, enabling more efficient covert communications without compromising detectability or fluency.

Abstract

Neural linguistic steganography aims to embed information into natural text while preserving statistical undetectability. A fundamental challenge in this ffeld stems from tokenization ambiguity in modern tokenizers, which can lead to catastrophic decoding failures. The recent method, SyncPool, addresses this ambiguity by employing a coarse-grained synchronization mechanism over groups of ambiguous candidates. However, SyncPool sacriffces embedding capacity, as it utilizes the entire Shannon entropy of an ambiguous group solely for synchronization rather than for payload embedding. We propose a method named look-ahead Sync, which overcomes the capacity limitation of SyncPool while retaining its provable security guarantees. Our approach performs minimal synchronized sampling only on truly indistinguishable token sequences, while strategically preserving all other discernible paths to maximize embedding capacity. We provide theoretical proofs for the security of our method and analyze the gap between its achievable embedding capacity and the theoretical upper bound. Experiments on English (using Llama 3) and Chinese (using Qwen 2.5) benchmarks show that our method consistently approaches the theoretical capacity upper bound and signiffcantly outperforms SyncPool. The improvement in embedding rate exceeds 160% in English and 25% in Chinese, particularly in settings with larger candidate pools. This work represents a signiffcant step toward practical high-capacity provably secure linguistic steganography.

A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography

TL;DR

This work tackles tokenization-induced decoding failures in neural linguistic steganography by introducing Look-ahead Sync, a recursive disambiguation algorithm that preserves non-payload entropy to reclaim embedding capacity while maintaining computational zero-KL security. The method builds on a distribution-preserving disambiguation framework and optimizes the trade-off between security and capacity by performing minimal synchronized sampling only on truly indistinguishable sequences. Theoretical results yield a capacity upper bound under zero-KL security and quantify the residual gap, while empirical evaluations on English and Chinese benchmarks show substantial capacity gains (up to over 160% in some settings) with zero KL divergence and high text quality. The approach significantly advances practical high-capacity provably secure linguistic steganography, enabling more efficient covert communications without compromising detectability or fluency.

Abstract

Neural linguistic steganography aims to embed information into natural text while preserving statistical undetectability. A fundamental challenge in this ffeld stems from tokenization ambiguity in modern tokenizers, which can lead to catastrophic decoding failures. The recent method, SyncPool, addresses this ambiguity by employing a coarse-grained synchronization mechanism over groups of ambiguous candidates. However, SyncPool sacriffces embedding capacity, as it utilizes the entire Shannon entropy of an ambiguous group solely for synchronization rather than for payload embedding. We propose a method named look-ahead Sync, which overcomes the capacity limitation of SyncPool while retaining its provable security guarantees. Our approach performs minimal synchronized sampling only on truly indistinguishable token sequences, while strategically preserving all other discernible paths to maximize embedding capacity. We provide theoretical proofs for the security of our method and analyze the gap between its achievable embedding capacity and the theoretical upper bound. Experiments on English (using Llama 3) and Chinese (using Qwen 2.5) benchmarks show that our method consistently approaches the theoretical capacity upper bound and signiffcantly outperforms SyncPool. The improvement in embedding rate exceeds 160% in English and 25% in Chinese, particularly in settings with larger candidate pools. This work represents a signiffcant step toward practical high-capacity provably secure linguistic steganography.

Paper Structure

This paper contains 27 sections, 24 equations, 7 figures, 2 tables, 4 algorithms.

Figures (7)

  • Figure 1: A toy example of tokenization ambiguity. The detokenizer $\varphi$ is not injective, so identical text (mistrust) can correspond to [278] or [377, 245]. If bits are embedded at the token level without resolving this ambiguity, the receiver may decode the wrong bits from the same visible string.
  • Figure 2: A schematic of the modular pipeline for a single generation step in an ambiguity-aware steganographic system, where the Disambiguation Module transforms the LLM's raw output into a structured set of choices for secure encoding.
  • Figure 3: The iterative architecture of Look-ahead Sync. The process begins with initialization and then enters a main loop that repeatedly executes three steps, namely partitioning candidates, embedding payload, and resolving ambiguity via a look-ahead mechanism to compute the next state. The loop continues until a terminal state is reached.
  • Figure 4: Inter-group encoding. Candidate token sequences are grouped by a common visible prefix and their probabilities are aggregated. An entropy encoder then maps a segment of the secret bitstream to a unique group according to the aggregated probabilities.
  • Figure 5: Look-ahead disambiguation. The selected intra-group candidates are partitioned into a prefix set and a partial set. A synchronized sampler selects a representative $s_{\text{sync}}$ from the prefix set, which is then expanded via an LLM call. The new candidate set for the next round is obtained by combining the preserved partial set with the new expansions, thereby reallocating probability mass to new continuations.
  • ...and 2 more figures