Secure & Personalized Music-to-Video Generation via CHARCHA
Mehul Agarwal, Gauri Agarwal, Santiago Benoit, Andrew Lippman, Jean Oh
TL;DR
This work introduces MVP, a fully automated pipeline for personalized music video generation that integrates audio transcription, emotion recognition, and language-conditioned diffusion to create visuals synchronized with a music track. A key innovation is CHARCHA, a facial identity verification protocol that enables authorized user likeness to be embedded in videos while mitigating impersonation risks, alongside LoRA-based personalization trained on user-provided images. The approach uses zero-shot rhythm/lyric extraction, LLM-driven prompt generation, and on-set spherical interpolation to maintain narrative and emotional coherence, with style control via fine-tuned diffusion checkpoints. Collectively, the framework demonstrates secure, interactive music video creation with personalized avatars, highlighting ethical considerations and practical implications for user-centric AI-generated media.
Abstract
Music is a deeply personal experience and our aim is to enhance this with a fully-automated pipeline for personalized music video generation. Our work allows listeners to not just be consumers but co-creators in the music video generation process by creating personalized, consistent and context-driven visuals based on lyrics, rhythm and emotion in the music. The pipeline combines multimodal translation and generation techniques and utilizes low-rank adaptation on listeners' images to create immersive music videos that reflect both the music and the individual. To ensure the ethical use of users' identity, we also introduce CHARCHA (patent pending), a facial identity verification protocol that protects people against unauthorized use of their face while at the same time collecting authorized images from users for personalizing their videos. This paper thus provides a secure and innovative framework for creating deeply personalized music videos.
