Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning
Anh Tong, Thanh Nguyen-Tang, Dongeun Lee, Duc Nguyen, Toan Tran, David Hall, Cheongwoong Kang, Jaesik Choi
TL;DR
The paper introduces DiffEqFormer, a non-autonomous neural ODE transformer that parameterizes all attention and feed-forward weights as time-dependent functions, enabling continuous-depth computation and flexible fine-tuning across architectures. By applying spectral analysis to QK and OV pairs and employing Lyapunov exponents for token-level sensitivity, it reveals dynamic properties that counter clustering observed in weight-sharing models and provides interpretable metrics. Empirically, DiffEqFormer achieves competitive perplexities on WikiText103 and OpenWebText, often surpassing corresponding GPT and Llama baselines, and supports adaptive fine-tuning via solver-step adjustments and LoRA. The work offers a principled continuous-depth perspective with practical benefits for model adaptation, memory efficiency considerations, and future integration of stochastic or advanced ODE solvers.
Abstract
Recent advancements in large language models (LLMs) based on transformer architectures have sparked significant interest in understanding their inner workings. In this paper, we introduce a novel approach to modeling transformer architectures using highly flexible non-autonomous neural ordinary differential equations (ODEs). Our proposed model parameterizes all weights of attention and feed-forward blocks through neural networks, expressing these weights as functions of a continuous layer index. Through spectral analysis of the model's dynamics, we uncover an increase in eigenvalue magnitude that challenges the weight-sharing assumption prevalent in existing theoretical studies. We also leverage the Lyapunov exponent to examine token-level sensitivity, enhancing model interpretability. Our neural ODE transformer demonstrates performance comparable to or better than vanilla transformers across various configurations and datasets, while offering flexible fine-tuning capabilities that can adapt to different architectural constraints.
