Multi-Attribute Vision Transformers are Efficient and Robust Learners
Hanan Gani, Nada Saadi, Noor Hussein, Karthik Nandakumar
TL;DR
The paper addresses multi-attribute learning in Vision Transformers by introducing MAL-ViT, which adds per-task attribute tokens that interact with patch tokens and share a single ViT backbone. It formalizes training with a weighted loss $L_{total}$, enabling per-attribute predictions from dedicated heads and shared representation learning. Empirical results on CelebA show MAL-ViT outperforms single-attribute ViTs and MAL-CNN, and demonstrates stronger robustness to adversarial perturbations including FGSM, BIM, PGD, UAP, and Patch-Fool attacks. These findings suggest attribute-token communication within ViTs yields both higher multi-task performance and greater resilience, with potential extensions to segmentation and detection tasks.
Abstract
Since their inception, Vision Transformers (ViTs) have emerged as a compelling alternative to Convolutional Neural Networks (CNNs) across a wide spectrum of tasks. ViTs exhibit notable characteristics, including global attention, resilience against occlusions, and adaptability to distribution shifts. One underexplored aspect of ViTs is their potential for multi-attribute learning, referring to their ability to simultaneously grasp multiple attribute-related tasks. In this paper, we delve into the multi-attribute learning capability of ViTs, presenting a straightforward yet effective strategy for training various attributes through a single ViT network as distinct tasks. We assess the resilience of multi-attribute ViTs against adversarial attacks and compare their performance against ViTs designed for single attributes. Moreover, we further evaluate the robustness of multi-attribute ViTs against a recent transformer based attack called Patch-Fool. Our empirical findings on the CelebA dataset provide validation for our assertion. Our code is available at https://github.com/hananshafi/MTL-ViT
