Table of Contents
Fetching ...

Multi-Attribute Vision Transformers are Efficient and Robust Learners

Hanan Gani, Nada Saadi, Noor Hussein, Karthik Nandakumar

TL;DR

The paper addresses multi-attribute learning in Vision Transformers by introducing MAL-ViT, which adds per-task attribute tokens that interact with patch tokens and share a single ViT backbone. It formalizes training with a weighted loss $L_{total}$, enabling per-attribute predictions from dedicated heads and shared representation learning. Empirical results on CelebA show MAL-ViT outperforms single-attribute ViTs and MAL-CNN, and demonstrates stronger robustness to adversarial perturbations including FGSM, BIM, PGD, UAP, and Patch-Fool attacks. These findings suggest attribute-token communication within ViTs yields both higher multi-task performance and greater resilience, with potential extensions to segmentation and detection tasks.

Abstract

Since their inception, Vision Transformers (ViTs) have emerged as a compelling alternative to Convolutional Neural Networks (CNNs) across a wide spectrum of tasks. ViTs exhibit notable characteristics, including global attention, resilience against occlusions, and adaptability to distribution shifts. One underexplored aspect of ViTs is their potential for multi-attribute learning, referring to their ability to simultaneously grasp multiple attribute-related tasks. In this paper, we delve into the multi-attribute learning capability of ViTs, presenting a straightforward yet effective strategy for training various attributes through a single ViT network as distinct tasks. We assess the resilience of multi-attribute ViTs against adversarial attacks and compare their performance against ViTs designed for single attributes. Moreover, we further evaluate the robustness of multi-attribute ViTs against a recent transformer based attack called Patch-Fool. Our empirical findings on the CelebA dataset provide validation for our assertion. Our code is available at https://github.com/hananshafi/MTL-ViT

Multi-Attribute Vision Transformers are Efficient and Robust Learners

TL;DR

The paper addresses multi-attribute learning in Vision Transformers by introducing MAL-ViT, which adds per-task attribute tokens that interact with patch tokens and share a single ViT backbone. It formalizes training with a weighted loss , enabling per-attribute predictions from dedicated heads and shared representation learning. Empirical results on CelebA show MAL-ViT outperforms single-attribute ViTs and MAL-CNN, and demonstrates stronger robustness to adversarial perturbations including FGSM, BIM, PGD, UAP, and Patch-Fool attacks. These findings suggest attribute-token communication within ViTs yields both higher multi-task performance and greater resilience, with potential extensions to segmentation and detection tasks.

Abstract

Since their inception, Vision Transformers (ViTs) have emerged as a compelling alternative to Convolutional Neural Networks (CNNs) across a wide spectrum of tasks. ViTs exhibit notable characteristics, including global attention, resilience against occlusions, and adaptability to distribution shifts. One underexplored aspect of ViTs is their potential for multi-attribute learning, referring to their ability to simultaneously grasp multiple attribute-related tasks. In this paper, we delve into the multi-attribute learning capability of ViTs, presenting a straightforward yet effective strategy for training various attributes through a single ViT network as distinct tasks. We assess the resilience of multi-attribute ViTs against adversarial attacks and compare their performance against ViTs designed for single attributes. Moreover, we further evaluate the robustness of multi-attribute ViTs against a recent transformer based attack called Patch-Fool. Our empirical findings on the CelebA dataset provide validation for our assertion. Our code is available at https://github.com/hananshafi/MTL-ViT
Paper Structure (16 sections, 5 equations, 5 figures, 2 tables)

This paper contains 16 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Our Multi-Attribute ViT Framework: We introduce additional learnable attribute (task) tokens corresponding to each attribute and propagate them jointly with patch tokens inside the ViT. We take the output corresponding to each attribute from its respective token.
  • Figure 2: Effectiveness of task tokens. Our method MAL-ViT with task tokens (shown in orange in the plot) outperforms the one without tokens. Best viewed in zoom.
  • Figure 3: Each bar represent the mean of the robust accuracy of MAL-ViT (blue) and MAL-ViT (red), when attacking the model with FGSM, PGD and BIM
  • Figure 4: MAL-ViT under UAP attack with different epsilon values
  • Figure 5: Mean robust accuracy vs. number of perturbed patch tokens under Patch-Fool attack on MAL-ViT.