Table of Contents
Fetching ...

Fully Attentional Networks with Self-emerging Token Labeling

Bingyin Zhao, Zhiding Yu, Shiyi Lan, Yutao Cheng, Anima Anandkumar, Yingjie Lao, Jose M. Alvarez

TL;DR

This paper introduces Self-Emerging Token Labeling (STL), a two-stage pre-training framework for Fully Attentional Networks (FAN) that leverages self-produced patch-level Token Labels. A FAN Token-Labeler (FAN-TL) first learns to generate semantically meaningful token labels, which a FAN student then uses alongside image-level class labels, with token selection via Gumbel-Softmax to focus on high-confidence patches. Empirical results show STL substantially improves robustness to out-of-distribution data (ImageNet-A/R, ImageNet-C) and enhances transfer to semantic segmentation and object detection, achieving state-of-the-art robustness on IN-A and IN-R with a modest parameter budget. The work demonstrates that self-produced knowledge from ViTs can effectively supervise dense representation learning, suggesting broad implications for pre-training strategies in vision models.

Abstract

Recent studies indicate that Vision Transformers (ViTs) are robust against out-of-distribution scenarios. In particular, the Fully Attentional Network (FAN) - a family of ViT backbones, has achieved state-of-the-art robustness. In this paper, we revisit the FAN models and improve their pre-training with a self-emerging token labeling (STL) framework. Our method contains a two-stage training framework. Specifically, we first train a FAN token labeler (FAN-TL) to generate semantically meaningful patch token labels, followed by a FAN student model training stage that uses both the token labels and the original class label. With the proposed STL framework, our best model based on FAN-L-Hybrid (77.3M parameters) achieves 84.8% Top-1 accuracy and 42.1% mCE on ImageNet-1K and ImageNet-C, and sets a new state-of-the-art for ImageNet-A (46.1%) and ImageNet-R (56.6%) without using extra data, outperforming the original FAN counterpart by significant margins. The proposed framework also demonstrates significantly enhanced performance on downstream tasks such as semantic segmentation, with up to 1.7% improvement in robustness over the counterpart model. Code is available at https://github.com/NVlabs/STL.

Fully Attentional Networks with Self-emerging Token Labeling

TL;DR

This paper introduces Self-Emerging Token Labeling (STL), a two-stage pre-training framework for Fully Attentional Networks (FAN) that leverages self-produced patch-level Token Labels. A FAN Token-Labeler (FAN-TL) first learns to generate semantically meaningful token labels, which a FAN student then uses alongside image-level class labels, with token selection via Gumbel-Softmax to focus on high-confidence patches. Empirical results show STL substantially improves robustness to out-of-distribution data (ImageNet-A/R, ImageNet-C) and enhances transfer to semantic segmentation and object detection, achieving state-of-the-art robustness on IN-A and IN-R with a modest parameter budget. The work demonstrates that self-produced knowledge from ViTs can effectively supervise dense representation learning, suggesting broad implications for pre-training strategies in vision models.

Abstract

Recent studies indicate that Vision Transformers (ViTs) are robust against out-of-distribution scenarios. In particular, the Fully Attentional Network (FAN) - a family of ViT backbones, has achieved state-of-the-art robustness. In this paper, we revisit the FAN models and improve their pre-training with a self-emerging token labeling (STL) framework. Our method contains a two-stage training framework. Specifically, we first train a FAN token labeler (FAN-TL) to generate semantically meaningful patch token labels, followed by a FAN student model training stage that uses both the token labels and the original class label. With the proposed STL framework, our best model based on FAN-L-Hybrid (77.3M parameters) achieves 84.8% Top-1 accuracy and 42.1% mCE on ImageNet-1K and ImageNet-C, and sets a new state-of-the-art for ImageNet-A (46.1%) and ImageNet-R (56.6%) without using extra data, outperforming the original FAN counterpart by significant margins. The proposed framework also demonstrates significantly enhanced performance on downstream tasks such as semantic segmentation, with up to 1.7% improvement in robustness over the counterpart model. Code is available at https://github.com/NVlabs/STL.
Paper Structure (28 sections, 5 equations, 5 figures, 13 tables)

This paper contains 28 sections, 5 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Results of zero-shot robustness against ImageNet-A and ImageNet-R. Models trained on ImageNet-1K with self-emerging token labels from FAN show superior robustness to the out-of-distribution data. Our best model (with only 77.3M parameters) achieves robust accuracy of 46.1% and 56.6% and sets a new record on ImageNet-A and ImageNet-R.
  • Figure 2: Illustration of token labels generated by FAN-TL and the token label confidence score distribution. (a). original image (class: "tench"), (b). binary color map of token labels (yellow: tokens classified as "tench", dark blue: tokens not classified as "tench") (c). trinary color map of token labels (cyan: foreground tokens with low confidence, yellow: foreground tokens with high confidence), (d). binary color map of foreground tokens selected by Gumbel-Softmax, (e). token label confidence score distribution of a batch of 16 images.
  • Figure 3: Illustration of Stage 2: Training student models with self-emerging token labels. In the training, token labels are generated by FAN-TL and assigned to patch tokens of student models. We incorporate the token labels and class labels to train student models jointly. FAN-TL can self-identify the incorrect token labels upon the confidence score. Tokens with high confidence scores offer a more accurate segmentation of objects and are crucial to robustness improvement. By applying spatial-only data augmentation to the inputs and Gumbel-Softmax to the token outputs of FAN-TL, we obtain the most accurate and critical token labels.
  • Figure 4: Visualization results of token labels generated by FAN-TL with full data augmentations. Strong augmentations significantly affect the quality of token labels.
  • Figure 5: More visualization results of token labels generated by FAN-TL. FAN-TL performs consistently well in capturing the object gestalt and generating accurate token labels for images with spatial-only data augmentations.