Table of Contents
Fetching ...

AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer

Yitao Xu, Tong Zhang, Sabine Süsstrunk

TL;DR

Adaptor Neural Cellular Automata (AdaNCA) is proposed for Vision Transformers that uses NCA as plug-and-play adaptors between ViT layers, thus enhancing ViT's performance and robustness against adversarial samples as well as out-of-distribution inputs.

Abstract

Vision Transformers (ViTs) demonstrate remarkable performance in image classification through visual-token interaction learning, particularly when equipped with local information via region attention or convolutions. Although such architectures improve the feature aggregation from different granularities, they often fail to contribute to the robustness of the networks. Neural Cellular Automata (NCA) enables the modeling of global visual-token representations through local interactions, with its training strategies and architecture design conferring strong generalization ability and robustness against noisy input. In this paper, we propose Adaptor Neural Cellular Automata (AdaNCA) for Vision Transformers that uses NCA as plug-and-play adaptors between ViT layers, thus enhancing ViT's performance and robustness against adversarial samples as well as out-of-distribution inputs. To overcome the large computational overhead of standard NCAs, we propose Dynamic Interaction for more efficient interaction learning. Using our analysis of AdaNCA placement and robustness improvement, we also develop an algorithm for identifying the most effective insertion points for AdaNCA. With less than a 3% increase in parameters, AdaNCA contributes to more than 10% absolute improvement in accuracy under adversarial attacks on the ImageNet1K benchmark. Moreover, we demonstrate with extensive evaluations across eight robustness benchmarks and four ViT architectures that AdaNCA, as a plug-and-play module, consistently improves the robustness of ViTs.

AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer

TL;DR

Adaptor Neural Cellular Automata (AdaNCA) is proposed for Vision Transformers that uses NCA as plug-and-play adaptors between ViT layers, thus enhancing ViT's performance and robustness against adversarial samples as well as out-of-distribution inputs.

Abstract

Vision Transformers (ViTs) demonstrate remarkable performance in image classification through visual-token interaction learning, particularly when equipped with local information via region attention or convolutions. Although such architectures improve the feature aggregation from different granularities, they often fail to contribute to the robustness of the networks. Neural Cellular Automata (NCA) enables the modeling of global visual-token representations through local interactions, with its training strategies and architecture design conferring strong generalization ability and robustness against noisy input. In this paper, we propose Adaptor Neural Cellular Automata (AdaNCA) for Vision Transformers that uses NCA as plug-and-play adaptors between ViT layers, thus enhancing ViT's performance and robustness against adversarial samples as well as out-of-distribution inputs. To overcome the large computational overhead of standard NCAs, we propose Dynamic Interaction for more efficient interaction learning. Using our analysis of AdaNCA placement and robustness improvement, we also develop an algorithm for identifying the most effective insertion points for AdaNCA. With less than a 3% increase in parameters, AdaNCA contributes to more than 10% absolute improvement in accuracy under adversarial attacks on the ImageNet1K benchmark. Moreover, we demonstrate with extensive evaluations across eight robustness benchmarks and four ViT architectures that AdaNCA, as a plug-and-play module, consistently improves the robustness of ViTs.
Paper Structure (52 sections, 12 equations, 12 figures, 23 tables, 1 algorithm)

This paper contains 52 sections, 12 equations, 12 figures, 23 tables, 1 algorithm.

Figures (12)

  • Figure 1: The accuracy under adversarial attacks (APGD-DLR autoattack) versus corruption error on out-of-distribution input (ImageNet-C imagenet-c) of various ViT models d2021convitrobustfy-attention-tapadlswinrvt. AdaNCA improves the robustness of different ViTs against both adversarial attacks and OOD input. $\star$: the same model architecture but with more layers.
  • Figure 2: Method overview. (a) To improve model performance and robustness, Neural Cellular Automata (NCA) can be inserted into Vision Transformers (ViTs) as Adaptors, hence termed AdaNCA. The details of AdaNCA are presented in Section \ref{['sec:enca-arch']}. The improvement is maximized when AdaNCA is inserted between two layer sets that each consists of similar layers. (b) The robustness improvement brought by AdaNCA is highly correlated with the corresponding network redundancy quantification of the insert position introduced in Section \ref{['sec:insert-pos']}. This supports the idea that AdaNCA should be placed between two sets of redundant layers.
  • Figure 3: Overview of AdaNCA architecture. Instead of concatenating the interaction results generated by the depth-wise convolutions, our Dynamic Interaction conducts a point-wise weighted sum on them to improve the efficiency and enhance the performance. The weights are obtained based on the token states so that each token can dynamically adjust, according to the inputs, the interaction strategy. The Multi-scale Dynamic Interaction aggregates the results from Dynamic Interaction , where the convolutions have different dilation rates. Then, to finish one step of evolution, the output is fed into the Update stage .
  • Figure 4: Pair-wise layer similarities. Layer sets are marked in red boxes. Swin-B-AdanCA has a clearer stage partition, which might be attributed to AdaNCA acting as an information transmitter between different layer sets.
  • Figure 5: Ablation on the (a) scales and (b) number of kernels used in our multi-scale Dynamic Interaction. Overly large scales can undermine the performance and so do too many or too few kernels. We choose $\mathcal{S}=2$, $\mathcal{M}=4$ to balance between the clean accuracy and robustness.
  • ...and 7 more figures