ODE-ViT: Plug & Play Attention Layer from the Generalization of the ViT as an Ordinary Differential Equation
Carlos Boned Riera, David Romero Sanchez, Oriol Ramos Terrades
TL;DR
ODE-ViT reframes Vision Transformers as continuous-depth models by treating the attention mechanism as an autonomous ODE, enabling stable, well-posed dynamics and improved interpretability with fewer parameters. The authors enforce local Lipschitz continuity and use center normalization, while a Lie–Trotter based decomposition splits the attention block into learnable subflows; they also introduce a plug-and-play teacher–student framework where a pretrained discrete ViT guides the ODE trajectory via MSE and JaSMin losses. Empirically, ODE-ViT achieves competitive CIFAR-10/100 performance with up to an order of magnitude fewer parameters and demonstrates robustness to step count; the teacher–student setup yields substantial gains over the base ODE-ViT. The work provides a theoretical and practical foundation for continuous-depth ViTs, opening avenues for higher‑order solvers, adaptive steps, and multi-modal extensions.
Abstract
In recent years, increasingly large models have achieved outstanding performance across CV tasks. However, these models demand substantial computational resources and storage, and their growing complexity limits our understanding of how they make decisions. Most of these architectures rely on the attention mechanism within Transformer-based designs. Building upon the connection between residual neural networks and ordinary differential equations (ODEs), we introduce ODE-ViT, a Vision Transformer reformulated as an ODE system that satisfies the conditions for well-posed and stable dynamics. Experiments on CIFAR-10 and CIFAR-100 demonstrate that ODE-ViT achieves stable, interpretable, and competitive performance with up to one order of magnitude fewer parameters, surpassing prior ODE-based Transformer approaches in classification tasks. We further propose a plug-and-play teacher-student framework in which a discrete ViT guides the continuous trajectory of ODE-ViT by treating the intermediate representations of the teacher as solutions of the ODE. This strategy improves performance by more than 10% compared to training a free ODE-ViT from scratch.
