Neural Network Diffusion

Kai Wang; Dongwen Tang; Boya Zeng; Yida Yin; Zhaopan Xu; Yukun Zhou; Zelin Zang; Trevor Darrell; Zhuang Liu; Yang You

Neural Network Diffusion

Kai Wang, Dongwen Tang, Boya Zeng, Yida Yin, Zhaopan Xu, Yukun Zhou, Zelin Zang, Trevor Darrell, Zhuang Liu, Yang You

TL;DR

<3-5 sentence high-level summary>Neural Network Diffusion (p-diff) tackles the problem of generating high-performing neural network parameters without full gradient optimization. It uses a simple two-stage architecture: an autoencoder to learn latent representations of parameter subsets and a diffusion model to synthesize these latents from random noise, decoded back into parameters. Empirically, p-diff matches or exceeds the performance of trained baselines across multiple datasets and architectures, while producing novel, non-memorized parameter configurations. The work demonstrates diffusion models’ versatility beyond image generation and suggests broader potential for parameter-space learning.

Abstract

Diffusion models have achieved remarkable success in image and video generation. In this work, we demonstrate that diffusion models can also \textit{generate high-performing neural network parameters}. Our approach is simple, utilizing an autoencoder and a diffusion model. The autoencoder extracts latent representations of a subset of the trained neural network parameters. Next, a diffusion model is trained to synthesize these latent representations from random noise. This model then generates new representations, which are passed through the autoencoder's decoder to produce new subsets of high-performing network parameters. Across various architectures and datasets, our approach consistently generates models with comparable or improved performance over trained networks, with minimal additional cost. Notably, we empirically find that the generated models are not memorizing the trained ones. Our results encourage more exploration into the versatile use of diffusion models. Our code is available \href{https://github.com/NUS-HPC-AI-Lab/Neural-Network-Diffusion}{here}.

Neural Network Diffusion

TL;DR

Abstract

Paper Structure (56 sections, 7 equations, 5 figures, 13 tables)

This paper contains 56 sections, 7 equations, 5 figures, 13 tables.

Introduction
Neural Network Diffusion
Preliminaries of Diffusion Models
Forward process.
Reverse process.
Training and inference.
Overview
Parameter Autoencoder
Data preparation.
Training.
Parameter Generation
Design space.
Experiments
Setup
Datasets and architectures.
...and 41 more sections

Figures (5)

Figure 1: Top illustrates the standard diffusion process in image generation. Bottom shows the parameter heatmap of the batch normalization (BN) layer at various stages of ResNet-18 training on CIFAR-100. In the heatmap, the upper half is BN weights, while the lower half is BN biases. Color corresponds to parameter value.
Figure 2: Our approach consists of two processes: parameter autoencoder and parameter generation. Parameter autoencoder aims to extract the latent representations and reconstruct model parameters via the decoder. The extracted representations are used to train a diffusion model (DM). During inference, a random noise vector is fed into the DM and the trained decoder to generate new parameters.
Figure 3: p-diff generalizes to under-trained original models. For original models trained for 1 or 3 epochs before the fine-tuning steps, the generated models can still achieve high accuracy with low similarity to original models. Results are on ResNet-18 and CIFAR-100.
Figure 4: Diversity of original models is important for the novelty of generated models. We use learning rates of 0.03, 0.003, and 0.3 for fine-tuning and saving the normalization layers of the converged model as training samples. Higher learning rate leads to greater original model diversity and lower maximum similarity of generated models. Results are on ResNet-18 and the CIFAR-100 dataset.
Figure 5: (a) shows the impact of parameter and latent noise augmentation on the novelty and accuracy of generated models. Using both parameter and latent noise augmentation achieves the highest accuracy. (b) illustrates the accuracy trajectories across different diffusion steps during inference. Our approach generates high-performing parameters through diverse paths. (c) presents the distributions of maximum similarity, where thickness indicates density. A large set of original models leads to higher novelty in generated models.

Neural Network Diffusion

TL;DR

Abstract

Neural Network Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (5)