Table of Contents
Fetching ...

Exploring compressibility of transformer based text-to-music (TTM) models

Vasileios Moschopoulos, Thanasis Kotsiopoulos, Pablo Peso Parada, Konstantinos Nikiforidis, Alexandros Stergiadis, Gerasimos Papakostas, Md Asif Jalal, Jisi Zhang, Anastasios Drosou, Karthikeyan Saravanan

TL;DR

This work tackles the challenge of deploying text-to-music models on resource-constrained devices by compressing transformer-based components via knowledge distillation and targeted training. It introduces TinyTTM, an 89.2M-parameter system built from a MusicGen-Small teacher and optimized across encoder, LM, and decoder using MusicBench with dynamic loss weighting and weight transferring. TinyTTM achieves FAD $=3.66$ and KL $=1.32$, outperforming MusicGen-Small on these metrics but not matching the fine-tuned teacher, while delivering substantial parameter and latency reductions. The study provides practical KD strategies for each module and demonstrates a viable path toward on-device TTM with comparable quality and significant efficiency gains.

Abstract

State-of-the art Text-To-Music (TTM) generative AI models are large and require desktop or server class compute, making them infeasible for deployment on mobile phones. This paper presents an analysis of trade-offs between model compression and generation performance of TTM models. We study compression through knowledge distillation and specific modifications that enable applicability over the various components of the TTM model (encoder, generative model and the decoder). Leveraging these methods we create TinyTTM (89.2M params) that achieves a FAD of 3.66 and KL of 1.32 on MusicBench dataset, better than MusicGen-Small (557.6M params) but not lower than MusicGen-small fine-tuned on MusicBench.

Exploring compressibility of transformer based text-to-music (TTM) models

TL;DR

This work tackles the challenge of deploying text-to-music models on resource-constrained devices by compressing transformer-based components via knowledge distillation and targeted training. It introduces TinyTTM, an 89.2M-parameter system built from a MusicGen-Small teacher and optimized across encoder, LM, and decoder using MusicBench with dynamic loss weighting and weight transferring. TinyTTM achieves FAD and KL , outperforming MusicGen-Small on these metrics but not matching the fine-tuned teacher, while delivering substantial parameter and latency reductions. The study provides practical KD strategies for each module and demonstrates a viable path toward on-device TTM with comparable quality and significant efficiency gains.

Abstract

State-of-the art Text-To-Music (TTM) generative AI models are large and require desktop or server class compute, making them infeasible for deployment on mobile phones. This paper presents an analysis of trade-offs between model compression and generation performance of TTM models. We study compression through knowledge distillation and specific modifications that enable applicability over the various components of the TTM model (encoder, generative model and the decoder). Leveraging these methods we create TinyTTM (89.2M params) that achieves a FAD of 3.66 and KL of 1.32 on MusicBench dataset, better than MusicGen-Small (557.6M params) but not lower than MusicGen-small fine-tuned on MusicBench.
Paper Structure (16 sections, 11 equations, 1 figure, 6 tables, 2 algorithms)