Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation
Jaechul Roh, Zachary Novack, Yuefeng Peng, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, Amir Houmansadr
TL;DR
The paper addresses copyright leakage in lyrics-conditioned and multimodal generation by showing that phonetic structure can trigger memorization even when semantics are altered. It introduces Adversarial PhoneTic Prompting (APT) and Adversarial VerbaTim Prompting (AVT) and defines a CMUdict-based phonetic similarity metric $\Phi$ to craft high-phi prompts. Experiments with SUNO, YuE, and Veo3 demonstrate that phoneme-preserving prompts yield outputs with strong melodic, rhythmic, and even visual fidelity to originals, revealing cross-modal memorization rooted in acoustic patterns. These findings underscore a critical vulnerability in transcript-conditioned systems and motivate new evaluation and safety frameworks that account for phonetic and multimodal leakage beyond verbatim text filters.
Abstract
Generative AI systems for music and video commonly use text-based filters to prevent the regurgitation of copyrighted material. We expose a fundamental flaw in this approach by introducing Adversarial PhoneTic Prompting (APT), a novel attack that bypasses these safeguards by exploiting phonetic memorization. The APT attack replaces iconic lyrics with homophonic but semantically unrelated alternatives (e.g., "mom's spaghetti" becomes "Bob's confetti"), preserving acoustic structure while altering meaning; we identify high-fidelity phonetic matches using CMU pronouncing dictionary. We demonstrate that leading Lyrics-to-Song (L2S) models like SUNO and YuE regenerate songs with striking melodic and rhythmic similarity to their copyrighted originals when prompted with these altered lyrics. More surprisingly, this vulnerability extends across modalities. When prompted with phonetically modified lyrics from a song, a Text-to-Video (T2V) model like Veo 3 reconstructs visual scenes from the original music video-including specific settings and character archetypes-despite the absence of any visual cues in the prompt. Our findings reveal that models memorize deep, structural patterns tied to acoustics, not just verbatim text. This phonetic-to-visual leakage represents a critical vulnerability in transcript-conditioned generative models, rendering simple copyright filters ineffective and raising urgent concerns about the secure deployment of multimodal AI systems. Demo examples are available at our project page (https://jrohsc.github.io/music_attack/).
