Table of Contents
Fetching ...

Secure & Personalized Music-to-Video Generation via CHARCHA

Mehul Agarwal, Gauri Agarwal, Santiago Benoit, Andrew Lippman, Jean Oh

TL;DR

This work introduces MVP, a fully automated pipeline for personalized music video generation that integrates audio transcription, emotion recognition, and language-conditioned diffusion to create visuals synchronized with a music track. A key innovation is CHARCHA, a facial identity verification protocol that enables authorized user likeness to be embedded in videos while mitigating impersonation risks, alongside LoRA-based personalization trained on user-provided images. The approach uses zero-shot rhythm/lyric extraction, LLM-driven prompt generation, and on-set spherical interpolation to maintain narrative and emotional coherence, with style control via fine-tuned diffusion checkpoints. Collectively, the framework demonstrates secure, interactive music video creation with personalized avatars, highlighting ethical considerations and practical implications for user-centric AI-generated media.

Abstract

Music is a deeply personal experience and our aim is to enhance this with a fully-automated pipeline for personalized music video generation. Our work allows listeners to not just be consumers but co-creators in the music video generation process by creating personalized, consistent and context-driven visuals based on lyrics, rhythm and emotion in the music. The pipeline combines multimodal translation and generation techniques and utilizes low-rank adaptation on listeners' images to create immersive music videos that reflect both the music and the individual. To ensure the ethical use of users' identity, we also introduce CHARCHA (patent pending), a facial identity verification protocol that protects people against unauthorized use of their face while at the same time collecting authorized images from users for personalizing their videos. This paper thus provides a secure and innovative framework for creating deeply personalized music videos.

Secure & Personalized Music-to-Video Generation via CHARCHA

TL;DR

This work introduces MVP, a fully automated pipeline for personalized music video generation that integrates audio transcription, emotion recognition, and language-conditioned diffusion to create visuals synchronized with a music track. A key innovation is CHARCHA, a facial identity verification protocol that enables authorized user likeness to be embedded in videos while mitigating impersonation risks, alongside LoRA-based personalization trained on user-provided images. The approach uses zero-shot rhythm/lyric extraction, LLM-driven prompt generation, and on-set spherical interpolation to maintain narrative and emotional coherence, with style control via fine-tuned diffusion checkpoints. Collectively, the framework demonstrates secure, interactive music video creation with personalized avatars, highlighting ethical considerations and practical implications for user-centric AI-generated media.

Abstract

Music is a deeply personal experience and our aim is to enhance this with a fully-automated pipeline for personalized music video generation. Our work allows listeners to not just be consumers but co-creators in the music video generation process by creating personalized, consistent and context-driven visuals based on lyrics, rhythm and emotion in the music. The pipeline combines multimodal translation and generation techniques and utilizes low-rank adaptation on listeners' images to create immersive music videos that reflect both the music and the individual. To ensure the ethical use of users' identity, we also introduce CHARCHA (patent pending), a facial identity verification protocol that protects people against unauthorized use of their face while at the same time collecting authorized images from users for personalizing their videos. This paper thus provides a secure and innovative framework for creating deeply personalized music videos.

Paper Structure

This paper contains 17 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Image stills and lyrics from generated music videos for Rick Astley's "Never Gonna Give You Up," with character reference from CHARCHA. The videos use Queratogray Sketchsketch, Western Animation Diffusionwest, and Realistic Vision V5.1real_vis checkpoint models
  • Figure 2: Image generation based on the lyric "I just wanna tell you how I'm feeling", progressively incorporating LLM conditioning, negative prompting, style prompting, and emotion prompting
  • Figure 3: Left: valence/arousal emotion spectrum. Right: serene-melancholy spherical interpolation
  • Figure 4: 7 CHARCHA Protocol Actions & their backend detection using MediaPipe
  • Figure 5: Survey of CHARCHA experiment with n=16 participants
  • ...and 5 more figures