LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order
Matthias Freiberger, Peter Kun, Anders Sundnes Løvlie, Sebastian Risi
TL;DR
This work tackles the robustness of vision transformers to layer pruning and arbitrary execution orders in distributed inference settings. It introduces LayerShuffle, which randomly permutes the stack of attention and feed-forward modules during training, with optional layer-position encodings and per-layer position predictors to enhance adaptability. Experiments on ImageNet2012 and CIFAR-100 show that while sequential performance can suffer modestly, LayerShuffle-trained models maintain meaningful accuracy under arbitrary layer orders and degrade gracefully under pruning, outperforming baselines in many arbitrary-order cases. Analyses of intermediate representations reveal that layers adapt their outputs based on their current position and input distribution, supporting the practical potential for distributed, fault-tolerant, and energy-efficient inference across networks of devices.
Abstract
Due to their architecture and how they are trained, artificial neural networks are typically not robust toward pruning or shuffling layers at test time. However, such properties would be desirable for different applications, such as distributed neural network architectures where the order of execution cannot be guaranteed or parts of the network can fail during inference. In this work, we address these issues through a number of training approaches for vision transformers whose most important component is randomizing the execution order of attention modules at training time. With our proposed approaches, vision transformers are capable to adapt to arbitrary layer execution orders at test time assuming one tolerates a reduction (about 20\%) in accuracy at the same model size. We analyse the feature representations of our trained models as well as how each layer contributes to the models prediction based on its position during inference. Our analysis shows that layers learn to contribute differently based on their position in the network. Finally, we layer-prune our models at test time and find that their performance declines gracefully. Code available at https://github.com/matfrei/layershuffle.
