Table of Contents
Fetching ...

MerkleSpeech: Public-Key Verifiable, Chunk-Localised Speech Provenance via Perceptual Fingerprints and Merkle Commitments

Tatsunori Ono

TL;DR

MerkleSpeech addresses the need for verifiable, chunk-level speech provenance that remains useful after distribution transforms. It combines deterministic chunk fingerprints, Merkle commitments, a signed root, and an in-band watermark to enable public verification of per-chunk provenance, with two tiers: WM-only attribution and MSv1 cryptographic integrity. On LibriSpeech, it achieves a 99.9% payload decode rate on clean audio and robust WM-only recovery under moderate transforms, while MSv1 provides strict integrity detection that is more brittle under degradation. This work enables third-party, splice-aware provenance workflows compatible with existing signed-manifest ecosystems, and it lays out a path toward transform-tolerant fingerprints for improved robustness without sacrificing cryptographic guarantees.

Abstract

Speech provenance goes beyond detecting whether a watermark is present. Real workflows involve splicing, quoting, trimming, and platform-level transforms that may preserve some regions while altering others. Neural watermarking systems have made strides in robustness and localised detection, but most deployments produce outputs with no third-party verifiable cryptographic proof tying a time segment to an issuer-signed original. Provenance standards like C2PA adopt signed manifests and Merkle-based fragment validation, yet their bindings target encoded assets and break under re-encoding or routine processing. We propose MerkleSpeech, a system for public-key verifiable, chunk-localised speech provenance offering two tiers of assurance. The first, a robust watermark attribution layer (WM-only), survives common distribution transforms and answers "was this chunk issued by a known party?". The second, a strict cryptographic integrity layer (MSv1), verifies Merkle inclusion of the chunk's fingerprint under an issuer signature. The system computes perceptual fingerprints over short speech chunks, commits them in a Merkle tree whose root is signed with an issuer key, and embeds a compact in-band watermark payload carrying a random content identifier and chunk metadata sufficient to retrieve Merkle inclusion proofs from a repository. Once the payload is extracted, all subsequent verification steps (signature check, fingerprint recomputation, Merkle inclusion) use only public information. The result is a splice-aware timeline indicating which regions pass each tier and why any given region fails. We describe the protocol, provide pseudocode, and present experiments targeting very low false positive rates under resampling, bandpass filtering, and additive noise, informed by recent audits identifying neural codecs as a major stressor for post-hoc audio watermarks.

MerkleSpeech: Public-Key Verifiable, Chunk-Localised Speech Provenance via Perceptual Fingerprints and Merkle Commitments

TL;DR

MerkleSpeech addresses the need for verifiable, chunk-level speech provenance that remains useful after distribution transforms. It combines deterministic chunk fingerprints, Merkle commitments, a signed root, and an in-band watermark to enable public verification of per-chunk provenance, with two tiers: WM-only attribution and MSv1 cryptographic integrity. On LibriSpeech, it achieves a 99.9% payload decode rate on clean audio and robust WM-only recovery under moderate transforms, while MSv1 provides strict integrity detection that is more brittle under degradation. This work enables third-party, splice-aware provenance workflows compatible with existing signed-manifest ecosystems, and it lays out a path toward transform-tolerant fingerprints for improved robustness without sacrificing cryptographic guarantees.

Abstract

Speech provenance goes beyond detecting whether a watermark is present. Real workflows involve splicing, quoting, trimming, and platform-level transforms that may preserve some regions while altering others. Neural watermarking systems have made strides in robustness and localised detection, but most deployments produce outputs with no third-party verifiable cryptographic proof tying a time segment to an issuer-signed original. Provenance standards like C2PA adopt signed manifests and Merkle-based fragment validation, yet their bindings target encoded assets and break under re-encoding or routine processing. We propose MerkleSpeech, a system for public-key verifiable, chunk-localised speech provenance offering two tiers of assurance. The first, a robust watermark attribution layer (WM-only), survives common distribution transforms and answers "was this chunk issued by a known party?". The second, a strict cryptographic integrity layer (MSv1), verifies Merkle inclusion of the chunk's fingerprint under an issuer signature. The system computes perceptual fingerprints over short speech chunks, commits them in a Merkle tree whose root is signed with an issuer key, and embeds a compact in-band watermark payload carrying a random content identifier and chunk metadata sufficient to retrieve Merkle inclusion proofs from a repository. Once the payload is extracted, all subsequent verification steps (signature check, fingerprint recomputation, Merkle inclusion) use only public information. The result is a splice-aware timeline indicating which regions pass each tier and why any given region fails. We describe the protocol, provide pseudocode, and present experiments targeting very low false positive rates under resampling, bandpass filtering, and additive noise, informed by recent audits identifying neural codecs as a major stressor for post-hoc audio watermarks.
Paper Structure (53 sections, 5 equations, 4 figures, 3 tables)

This paper contains 53 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: MerkleSpeech overview. Enrollment: chunk audio, compute chunk fingerprints, build Merkle tree, sign root, publish manifest and inclusion proofs. Distribution: embed compact in-band payload per chunk. Verification: decode payload, fetch manifest and proofs, verify signature and Merkle inclusion per chunk, then aggregate to a splice-aware timeline.
  • Figure 2: Verification pipeline. Streaming decode $\rightarrow$ manifest retrieval $\rightarrow$ signature verification $\rightarrow$ fingerprint recomputation and inclusion verification $\rightarrow$ splice-aware timeline with interpretable reason codes.
  • Figure 3: Robustness under distribution transforms. Grouped bars show decode rate, WM-only verification rate, and MSv1 verification rate across eight transform conditions, ordered from mildest (left) to harshest (right). Error bars show Wilson 95% binomial confidence intervals ($n = 90$ chunks per condition). The WM-only channel survives clipping, resampling, bandpass, and moderate noise; MSv1 passes only for clipping (near-identity transform). The strict MFCC fingerprint makes MSv1 deliberately sensitive to transforms in this instantiation; a transform-tolerant fingerprint (e.g., SSL embeddings) is future work (Section \ref{['sec:limitations']}).
  • Figure 4: Quality--robustness trade-off. WM-only verification rate under additive noise (SNR = 20 dB) as a function of embedding quality (time-domain SNR), evaluated on the same 90-chunk subset as Figure \ref{['fig:robustness']}. Each point uses an independent per-$\alpha$ enrollment (distinct CID and Merkle tree), so the $\alpha = 0.6$ rate may differ slightly from Figure \ref{['fig:robustness']} due to payload-dependent QIM bit patterns; both fall within each other's 95% CIs.