Exploring compressibility of transformer based text-to-music (TTM) models

Vasileios Moschopoulos; Thanasis Kotsiopoulos; Pablo Peso Parada; Konstantinos Nikiforidis; Alexandros Stergiadis; Gerasimos Papakostas; Md Asif Jalal; Jisi Zhang; Anastasios Drosou; Karthikeyan Saravanan

Exploring compressibility of transformer based text-to-music (TTM) models

Vasileios Moschopoulos, Thanasis Kotsiopoulos, Pablo Peso Parada, Konstantinos Nikiforidis, Alexandros Stergiadis, Gerasimos Papakostas, Md Asif Jalal, Jisi Zhang, Anastasios Drosou, Karthikeyan Saravanan

TL;DR

This work tackles the challenge of deploying text-to-music models on resource-constrained devices by compressing transformer-based components via knowledge distillation and targeted training. It introduces TinyTTM, an 89.2M-parameter system built from a MusicGen-Small teacher and optimized across encoder, LM, and decoder using MusicBench with dynamic loss weighting and weight transferring. TinyTTM achieves FAD $=3.66$ and KL $=1.32$, outperforming MusicGen-Small on these metrics but not matching the fine-tuned teacher, while delivering substantial parameter and latency reductions. The study provides practical KD strategies for each module and demonstrates a viable path toward on-device TTM with comparable quality and significant efficiency gains.

Abstract

State-of-the art Text-To-Music (TTM) generative AI models are large and require desktop or server class compute, making them infeasible for deployment on mobile phones. This paper presents an analysis of trade-offs between model compression and generation performance of TTM models. We study compression through knowledge distillation and specific modifications that enable applicability over the various components of the TTM model (encoder, generative model and the decoder). Leveraging these methods we create TinyTTM (89.2M params) that achieves a FAD of 3.66 and KL of 1.32 on MusicBench dataset, better than MusicGen-Small (557.6M params) but not lower than MusicGen-small fine-tuned on MusicBench.

Exploring compressibility of transformer based text-to-music (TTM) models

TL;DR

and KL

, outperforming MusicGen-Small on these metrics but not matching the fine-tuned teacher, while delivering substantial parameter and latency reductions. The study provides practical KD strategies for each module and demonstrates a viable path toward on-device TTM with comparable quality and significant efficiency gains.

Abstract

Paper Structure (16 sections, 11 equations, 1 figure, 6 tables, 2 algorithms)

This paper contains 16 sections, 11 equations, 1 figure, 6 tables, 2 algorithms.

Introduction
TinyTTM
Encoder
Generative Transformer Language Model (LM)
Weight Transferring
Dynamic Loss Weight Scheduling
Decoder
Evaluation setup
Experimental Analyses
Encoder analysis
Transformer Language Model analysis
Teacher LM Model Fine-Tuning
Student LM Knowledge Distillation
Decoder analysis
TinyTTM performance
...and 1 more sections

Figures (1)

Figure 1: MusicGen-Small vs. proposed TinyTTM.

Exploring compressibility of transformer based text-to-music (TTM) models

TL;DR

Abstract

Exploring compressibility of transformer based text-to-music (TTM) models

Authors

TL;DR

Abstract

Table of Contents

Figures (1)