Table of Contents
Fetching ...

Device-Guided Music Transfer

Manh Pham Hung, Changshuo Hu, Ting Dang, Dong Ma

TL;DR

DeMT tackles device-dependent music perception by learning speaker-specific embeddings from visual frequency response curves using a vision-language model, and conditioning a FiLM-enabled Demucs transformer to transfer music playback toward target device characteristics. A self-collected six-device dataset and controlled FR capture underpin the training, with 20-epoch optimization and 5-second segments. Results show strong device transfer performance and meaningful few-shot generalization to unseen devices, supported by ablations that highlight the importance of VLM-FiLM and Harman-curve context. The approach offers practical pathways for device-aware audio augmentation and quality enhancement across diverse playback hardware.

Abstract

Device-guided music transfer adapts playback across unseen devices for users who lack them. Existing methods mainly focus on modifying the timbre, rhythm, harmony, or instrumentation to mimic genres or artists, overlooking the diverse hardware properties of the playback device (i.e., speaker). Therefore, we propose DeMT, which processes a speaker's frequency response curve as a line graph using a vision-language model to extract device embeddings. These embeddings then condition a hybrid transformer via feature-wise linear modulation. Fine-tuned on a self-collected dataset, DeMT enables effective speaker-style transfer and robust few-shot adaptation for unseen devices, supporting applications like device-style augmentation and quality enhancement.

Device-Guided Music Transfer

TL;DR

DeMT tackles device-dependent music perception by learning speaker-specific embeddings from visual frequency response curves using a vision-language model, and conditioning a FiLM-enabled Demucs transformer to transfer music playback toward target device characteristics. A self-collected six-device dataset and controlled FR capture underpin the training, with 20-epoch optimization and 5-second segments. Results show strong device transfer performance and meaningful few-shot generalization to unseen devices, supported by ablations that highlight the importance of VLM-FiLM and Harman-curve context. The approach offers practical pathways for device-aware audio augmentation and quality enhancement across diverse playback hardware.

Abstract

Device-guided music transfer adapts playback across unseen devices for users who lack them. Existing methods mainly focus on modifying the timbre, rhythm, harmony, or instrumentation to mimic genres or artists, overlooking the diverse hardware properties of the playback device (i.e., speaker). Therefore, we propose DeMT, which processes a speaker's frequency response curve as a line graph using a vision-language model to extract device embeddings. These embeddings then condition a hybrid transformer via feature-wise linear modulation. Fine-tuned on a self-collected dataset, DeMT enables effective speaker-style transfer and robust few-shot adaptation for unseen devices, supporting applications like device-style augmentation and quality enhancement.

Paper Structure

This paper contains 11 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of DeMT. Here, EP represents the embedding pool.
  • Figure 2: Data collection.
  • Figure 3: FRCs of the six speakers included in the study.
  • Figure 4: T-SNE visualization of the device embeddings.
  • Figure 5: Examples of input, target, and output audio spectrograms across six audio speakers.