Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization

Zexi Li; Lingzhi Gao; Chao Wu

Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization

Zexi Li, Lingzhi Gao, Chao Wu

TL;DR

The paper tackles the challenge of train-once-for-all personalization by asking whether GenAI can generate personalized model parameters from natural-language prompts. It introduces Tina, a text-conditioned diffusion transformer guided by CLIP embeddings to produce task-specific model parameters from prompts, and demonstrates strong in-distribution and out-of-distribution generalization even with limited data. Key contributions include an end-to-end text-to-model framework, support for image prompts via CLIP, a classification-sequence padding mechanism for variable task sizes, and comprehensive analyses of scaling, prompt modalities, and unseen classes. The approach outperforms strong baselines across multiple datasets and scales to larger backbones like ViT-B/32, highlighting a promising path toward scalable, user-driven personalization in GenAI.

Abstract

Generative artificial intelligence (GenAI) has made significant progress in understanding world knowledge and generating content from human languages across various modalities, like text-to-text large language models, text-to-image stable diffusion, and text-to-video Sora. While in this paper, we investigate the capability of GenAI for text-to-model generation, to see whether GenAI can comprehend hyper-level knowledge embedded within AI itself parameters. Specifically, we study a practical scenario termed train-once-for-all personalization, aiming to generate personalized models for diverse end-users and tasks using text prompts. Inspired by the recent emergence of neural network diffusion, we present Tina, a text-conditioned neural network diffusion for train-once-for-all personalization. Tina leverages a diffusion transformer model conditioned on task descriptions embedded using a CLIP model. Despite the astronomical number of potential personalized tasks (e.g., $1.73\times10^{13}$), by our design, Tina demonstrates remarkable in-distribution and out-of-distribution generalization even trained on small datasets ($\sim 1000$). We further verify whether and how \Tina understands world knowledge by analyzing its capabilities under zero-shot/few-shot image prompts, different numbers of personalized classes, prompts of natural language descriptions, and predicting unseen entities.

Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization

TL;DR

Abstract

), by our design, Tina demonstrates remarkable in-distribution and out-of-distribution generalization even trained on small datasets (

). We further verify whether and how \Tina understands world knowledge by analyzing its capabilities under zero-shot/few-shot image prompts, different numbers of personalized classes, prompts of natural language descriptions, and predicting unseen entities.

Paper Structure (33 sections, 1 equation, 7 figures, 7 tables, 1 algorithm)

This paper contains 33 sections, 1 equation, 7 figures, 7 tables, 1 algorithm.

Introduction
Methodology
Problem Setup
Definition of Setup
Strong Baselines: Classifier Selection and TAPER
Dataset Preparation and Description
Proposed Tina: Text-conditioned Neural Network Diffusion Model
Framework Overview
Architecture and Training Objective
Design Details
Experiments
Experimental Setups
Results under Different Datasets
In-depth Analysis of Tina
Conclusion
...and 18 more sections

Figures (7)

Figure 1: Demonstration of train-once-for-all personalization scenario. Users have text descriptions of the desired personalized models.
Figure 2: Description of the training and testing data for Tina. p-Model is short for personalized models. The blue blocks are for training, and the green blocks are for testing.
Figure 3: Framework overview of Tina.(a) Training stage. The p-Models are firstly augmented by our classifier augmentation strategy and then noised according to the diffusion step. The p-Models are tokenized into chunks of vectors, and the classification sequence padding is optionally used if the classification length is shorter than the default. The CLIP text encoder is used to encode the users' text prompts during training. (b) Testing stage. Random noises are tokenized and denoised into parameters of p-Models. Thanks to the vision-language alignment of CLIP, Tina takes both text and visual prompts as the diffusion conditions.
Figure 4: Tina capability analysis w.r.t. different parameterization and training schemes.(a) Scaling the parameters of DiT in Tina. CNN-5K (14K) means the p-Model is a CNN with 5K (14K) parameters. From 152M (hidden size 32) to 789M (hidden size 2048), scaling helps in the emergence of intelligence. (b) Parameter inheritance from pretrained G.pt helps speed up training in the early. (c) Training Tina with image-prompted data versus text-prompted data. The text-prompted has faster convergence.
Figure 5: Scaling the input dimensions and training data for Tina.
...and 2 more figures

Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization

TL;DR

Abstract

Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)