Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization
Zexi Li, Lingzhi Gao, Chao Wu
TL;DR
The paper tackles the challenge of train-once-for-all personalization by asking whether GenAI can generate personalized model parameters from natural-language prompts. It introduces Tina, a text-conditioned diffusion transformer guided by CLIP embeddings to produce task-specific model parameters from prompts, and demonstrates strong in-distribution and out-of-distribution generalization even with limited data. Key contributions include an end-to-end text-to-model framework, support for image prompts via CLIP, a classification-sequence padding mechanism for variable task sizes, and comprehensive analyses of scaling, prompt modalities, and unseen classes. The approach outperforms strong baselines across multiple datasets and scales to larger backbones like ViT-B/32, highlighting a promising path toward scalable, user-driven personalization in GenAI.
Abstract
Generative artificial intelligence (GenAI) has made significant progress in understanding world knowledge and generating content from human languages across various modalities, like text-to-text large language models, text-to-image stable diffusion, and text-to-video Sora. While in this paper, we investigate the capability of GenAI for text-to-model generation, to see whether GenAI can comprehend hyper-level knowledge embedded within AI itself parameters. Specifically, we study a practical scenario termed train-once-for-all personalization, aiming to generate personalized models for diverse end-users and tasks using text prompts. Inspired by the recent emergence of neural network diffusion, we present Tina, a text-conditioned neural network diffusion for train-once-for-all personalization. Tina leverages a diffusion transformer model conditioned on task descriptions embedded using a CLIP model. Despite the astronomical number of potential personalized tasks (e.g., $1.73\times10^{13}$), by our design, Tina demonstrates remarkable in-distribution and out-of-distribution generalization even trained on small datasets ($\sim 1000$). We further verify whether and how \Tina understands world knowledge by analyzing its capabilities under zero-shot/few-shot image prompts, different numbers of personalized classes, prompts of natural language descriptions, and predicting unseen entities.
