A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography

Yapei Feng; Feng Jiang; Shanhao Wu; Hua Zhong

A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography

Yapei Feng, Feng Jiang, Shanhao Wu, Hua Zhong

TL;DR

This work tackles tokenization-induced decoding failures in neural linguistic steganography by introducing Look-ahead Sync, a recursive disambiguation algorithm that preserves non-payload entropy to reclaim embedding capacity while maintaining computational zero-KL security. The method builds on a distribution-preserving disambiguation framework and optimizes the trade-off between security and capacity by performing minimal synchronized sampling only on truly indistinguishable sequences. Theoretical results yield a capacity upper bound under zero-KL security and quantify the residual gap, while empirical evaluations on English and Chinese benchmarks show substantial capacity gains (up to over 160% in some settings) with zero KL divergence and high text quality. The approach significantly advances practical high-capacity provably secure linguistic steganography, enabling more efficient covert communications without compromising detectability or fluency.

Abstract

Neural linguistic steganography aims to embed information into natural text while preserving statistical undetectability. A fundamental challenge in this ffeld stems from tokenization ambiguity in modern tokenizers, which can lead to catastrophic decoding failures. The recent method, SyncPool, addresses this ambiguity by employing a coarse-grained synchronization mechanism over groups of ambiguous candidates. However, SyncPool sacriffces embedding capacity, as it utilizes the entire Shannon entropy of an ambiguous group solely for synchronization rather than for payload embedding. We propose a method named look-ahead Sync, which overcomes the capacity limitation of SyncPool while retaining its provable security guarantees. Our approach performs minimal synchronized sampling only on truly indistinguishable token sequences, while strategically preserving all other discernible paths to maximize embedding capacity. We provide theoretical proofs for the security of our method and analyze the gap between its achievable embedding capacity and the theoretical upper bound. Experiments on English (using Llama 3) and Chinese (using Qwen 2.5) benchmarks show that our method consistently approaches the theoretical capacity upper bound and signiffcantly outperforms SyncPool. The improvement in embedding rate exceeds 160% in English and 25% in Chinese, particularly in settings with larger candidate pools. This work represents a signiffcant step toward practical high-capacity provably secure linguistic steganography.

A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography

TL;DR

Abstract

A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)