Table of Contents
Fetching ...

Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation

Jaechul Roh, Zachary Novack, Yuefeng Peng, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, Amir Houmansadr

TL;DR

The paper addresses copyright leakage in lyrics-conditioned and multimodal generation by showing that phonetic structure can trigger memorization even when semantics are altered. It introduces Adversarial PhoneTic Prompting (APT) and Adversarial VerbaTim Prompting (AVT) and defines a CMUdict-based phonetic similarity metric $\Phi$ to craft high-phi prompts. Experiments with SUNO, YuE, and Veo3 demonstrate that phoneme-preserving prompts yield outputs with strong melodic, rhythmic, and even visual fidelity to originals, revealing cross-modal memorization rooted in acoustic patterns. These findings underscore a critical vulnerability in transcript-conditioned systems and motivate new evaluation and safety frameworks that account for phonetic and multimodal leakage beyond verbatim text filters.

Abstract

Generative AI systems for music and video commonly use text-based filters to prevent the regurgitation of copyrighted material. We expose a fundamental flaw in this approach by introducing Adversarial PhoneTic Prompting (APT), a novel attack that bypasses these safeguards by exploiting phonetic memorization. The APT attack replaces iconic lyrics with homophonic but semantically unrelated alternatives (e.g., "mom's spaghetti" becomes "Bob's confetti"), preserving acoustic structure while altering meaning; we identify high-fidelity phonetic matches using CMU pronouncing dictionary. We demonstrate that leading Lyrics-to-Song (L2S) models like SUNO and YuE regenerate songs with striking melodic and rhythmic similarity to their copyrighted originals when prompted with these altered lyrics. More surprisingly, this vulnerability extends across modalities. When prompted with phonetically modified lyrics from a song, a Text-to-Video (T2V) model like Veo 3 reconstructs visual scenes from the original music video-including specific settings and character archetypes-despite the absence of any visual cues in the prompt. Our findings reveal that models memorize deep, structural patterns tied to acoustics, not just verbatim text. This phonetic-to-visual leakage represents a critical vulnerability in transcript-conditioned generative models, rendering simple copyright filters ineffective and raising urgent concerns about the secure deployment of multimodal AI systems. Demo examples are available at our project page (https://jrohsc.github.io/music_attack/).

Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation

TL;DR

The paper addresses copyright leakage in lyrics-conditioned and multimodal generation by showing that phonetic structure can trigger memorization even when semantics are altered. It introduces Adversarial PhoneTic Prompting (APT) and Adversarial VerbaTim Prompting (AVT) and defines a CMUdict-based phonetic similarity metric to craft high-phi prompts. Experiments with SUNO, YuE, and Veo3 demonstrate that phoneme-preserving prompts yield outputs with strong melodic, rhythmic, and even visual fidelity to originals, revealing cross-modal memorization rooted in acoustic patterns. These findings underscore a critical vulnerability in transcript-conditioned systems and motivate new evaluation and safety frameworks that account for phonetic and multimodal leakage beyond verbatim text filters.

Abstract

Generative AI systems for music and video commonly use text-based filters to prevent the regurgitation of copyrighted material. We expose a fundamental flaw in this approach by introducing Adversarial PhoneTic Prompting (APT), a novel attack that bypasses these safeguards by exploiting phonetic memorization. The APT attack replaces iconic lyrics with homophonic but semantically unrelated alternatives (e.g., "mom's spaghetti" becomes "Bob's confetti"), preserving acoustic structure while altering meaning; we identify high-fidelity phonetic matches using CMU pronouncing dictionary. We demonstrate that leading Lyrics-to-Song (L2S) models like SUNO and YuE regenerate songs with striking melodic and rhythmic similarity to their copyrighted originals when prompted with these altered lyrics. More surprisingly, this vulnerability extends across modalities. When prompted with phonetically modified lyrics from a song, a Text-to-Video (T2V) model like Veo 3 reconstructs visual scenes from the original music video-including specific settings and character archetypes-despite the absence of any visual cues in the prompt. Our findings reveal that models memorize deep, structural patterns tied to acoustics, not just verbatim text. This phonetic-to-visual leakage represents a critical vulnerability in transcript-conditioned generative models, rendering simple copyright filters ineffective and raising urgent concerns about the secure deployment of multimodal AI systems. Demo examples are available at our project page (https://jrohsc.github.io/music_attack/).

Paper Structure

This paper contains 38 sections, 2 equations, 23 figures, 8 tables.

Figures (23)

  • Figure 1: Adversarial PhoneTic Prompting (APT). We modify Lose Yourself lyrics by preserving phonetic rhythm and rhyme while altering semantics (e.g., “mom's spaghetti”→“Bob's confetti”, “vomit”→“yogurt”). Despite these changes, SUNO generates a song that remains strongly aligned with the original training instance.
  • Figure 2: Phoneme-modified variant of Eminem’s “Lose Yourself” with altered lines highlighted in red. The distortion preserves flow while revealing vulnerabilities in L2S models.
  • Figure 3: AudioJudge Similarity Heatmaps. We evaluate pairwise melody and rhythm similarity between original and generated songs using AudioJudge across four categories: (1) Mandarin, (2) Cantonese, and (3) other English songs. Each heatmap cell shows the overall similarity score (0–100) between an original and generated song. Green indicates high similarity (80–100), yellow moderate (40–79), and red low similarity (0–39). Diagonal cells reflect self-pairing scores (i.e., original with phoneme-modified versions of the same song). The distribution of scores confirms that AudioJudge does not assign uniformly high scores across all comparisons, but rather discriminates meaningfully based on melodic and rhythmic correspondence. This supports its reliability as an evaluative tool for music generation similarity.
  • Figure 5: Comparative breakdown of Kendrick Lamar’s *DNA* and a rap-styled variant across five musical dimensions, showing strong similarity in rhythm and vocal identity.
  • Figure 6: Distribution of human similarity ratings collected in our listening study. Participants rated the musical similarity between generated and original audio samples on a 5-point Likert scale, across three languages (Mandarin, Cantonese, English) and two prompt types: strong (exact-match lyrics) and weak (semantic paraphrases). Strong prompts consistently received higher ratings, indicating that lexical fidelity strongly correlates with perceived musical similarity.
  • ...and 18 more figures