Table of Contents
Fetching ...

LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order

Matthias Freiberger, Peter Kun, Anders Sundnes Løvlie, Sebastian Risi

TL;DR

This work tackles the robustness of vision transformers to layer pruning and arbitrary execution orders in distributed inference settings. It introduces LayerShuffle, which randomly permutes the stack of attention and feed-forward modules during training, with optional layer-position encodings and per-layer position predictors to enhance adaptability. Experiments on ImageNet2012 and CIFAR-100 show that while sequential performance can suffer modestly, LayerShuffle-trained models maintain meaningful accuracy under arbitrary layer orders and degrade gracefully under pruning, outperforming baselines in many arbitrary-order cases. Analyses of intermediate representations reveal that layers adapt their outputs based on their current position and input distribution, supporting the practical potential for distributed, fault-tolerant, and energy-efficient inference across networks of devices.

Abstract

Due to their architecture and how they are trained, artificial neural networks are typically not robust toward pruning or shuffling layers at test time. However, such properties would be desirable for different applications, such as distributed neural network architectures where the order of execution cannot be guaranteed or parts of the network can fail during inference. In this work, we address these issues through a number of training approaches for vision transformers whose most important component is randomizing the execution order of attention modules at training time. With our proposed approaches, vision transformers are capable to adapt to arbitrary layer execution orders at test time assuming one tolerates a reduction (about 20\%) in accuracy at the same model size. We analyse the feature representations of our trained models as well as how each layer contributes to the models prediction based on its position during inference. Our analysis shows that layers learn to contribute differently based on their position in the network. Finally, we layer-prune our models at test time and find that their performance declines gracefully. Code available at https://github.com/matfrei/layershuffle.

LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order

TL;DR

This work tackles the robustness of vision transformers to layer pruning and arbitrary execution orders in distributed inference settings. It introduces LayerShuffle, which randomly permutes the stack of attention and feed-forward modules during training, with optional layer-position encodings and per-layer position predictors to enhance adaptability. Experiments on ImageNet2012 and CIFAR-100 show that while sequential performance can suffer modestly, LayerShuffle-trained models maintain meaningful accuracy under arbitrary layer orders and degrade gracefully under pruning, outperforming baselines in many arbitrary-order cases. Analyses of intermediate representations reveal that layers adapt their outputs based on their current position and input distribution, supporting the practical potential for distributed, fault-tolerant, and energy-efficient inference across networks of devices.

Abstract

Due to their architecture and how they are trained, artificial neural networks are typically not robust toward pruning or shuffling layers at test time. However, such properties would be desirable for different applications, such as distributed neural network architectures where the order of execution cannot be guaranteed or parts of the network can fail during inference. In this work, we address these issues through a number of training approaches for vision transformers whose most important component is randomizing the execution order of attention modules at training time. With our proposed approaches, vision transformers are capable to adapt to arbitrary layer execution orders at test time assuming one tolerates a reduction (about 20\%) in accuracy at the same model size. We analyse the feature representations of our trained models as well as how each layer contributes to the models prediction based on its position during inference. Our analysis shows that layers learn to contribute differently based on their position in the network. Finally, we layer-prune our models at test time and find that their performance declines gracefully. Code available at https://github.com/matfrei/layershuffle.
Paper Structure (11 sections, 6 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 11 sections, 6 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: LayerShuffle training results in robust vision transformers. (a) Illustration of the LayerShuffle approach. The execution order of attention modules is randomly permuted during training. (b) ImageNet2012 validation accuracy vs. number of pruned layers when executing layers in their original sequence. LayerShuffle performs similarly to LayerDrop (p=0.2), despite no layers being removed during training. (c) When additionally shuffling the layers at test time, all models fail except for LayerShuffle, whose performance degrades gracefully as more layers are removed.
  • Figure 2: Attention module with layer position encoding.
  • Figure 3: Attention module with layer position prediction.
  • Figure 4: UMAP-projected embeddings and contributions to model prediction (estimated distribution of normalized L2 norms of class token) of layer outputs trained with shuffling execution order, baseline for comparison. Contrary to the baseline (a), the layer for a LayerShuffle-trained network (b) produces outputs in different subspaces of the latent space depending on their current position in the network. Darker colors indicate layer positions closer to the input; layer positions close to the output are shown in light colors. While layers in the baseline model overall contribute equally to the predictive output of the model, regardless of their current position in the network (c), the contribution of layers in the LayerShuffle-trained model's prediction (d) varies based on the distance to it's original position in the networks. Refinement of the model conditions its layers to only contribute to the overall predictive output if the received input lies within the layers learned distributions of inputs.