SeiT++: Masked Token Modeling Improves Storage-efficient Training

Minhyun Lee; Song Park; Byeongho Heo; Dongyoon Han; Hyunjung Shim

SeiT++: Masked Token Modeling Improves Storage-efficient Training

Minhyun Lee, Song Park, Byeongho Heo, Dongyoon Han, Hyunjung Shim

TL;DR

SeiT++ tackles the storage bottleneck in vision model training by learning from offline token representations and coupling Masked Token Modeling with novel token augmentations. It introduces TokenAdapt and ColorAdapt to safely apply augmentation in the token domain, enabling effective self-supervised pre-training and improving robustness and generalization. Empirical results across storage-efficient ImageNet-1k, fine-grained classification, ADE-20k segmentation, and robustness benchmarks show consistent gains over SeiT, with MTM providing additional throughput in low-storage regimes. The approach demonstrates strong practical impact for scaling vision models with limited storage and highlights broad applicability to alternative tokenizations and input formats.

Abstract

Recent advancements in Deep Neural Network (DNN) models have significantly improved performance across computer vision tasks. However, achieving highly generalizable and high-performing vision models requires expansive datasets, resulting in significant storage requirements. This storage challenge is a critical bottleneck for scaling up models. A recent breakthrough by SeiT proposed the use of Vector-Quantized (VQ) feature vectors (i.e., tokens) as network inputs for vision classification. This approach achieved 90% of the performance of a model trained on full-pixel images with only 1% of the storage. While SeiT needs labeled data, its potential in scenarios beyond fully supervised learning remains largely untapped. In this paper, we extend SeiT by integrating Masked Token Modeling (MTM) for self-supervised pre-training. Recognizing that self-supervised approaches often demand more data due to the lack of labels, we introduce TokenAdapt and ColorAdapt. These methods facilitate comprehensive token-friendly data augmentation, effectively addressing the increased data requirements of self-supervised learning. We evaluate our approach across various scenarios, including storage-efficient ImageNet-1k classification, fine-grained classification, ADE-20k semantic segmentation, and robustness benchmarks. Experimental results demonstrate consistent performance improvement in diverse experiments, validating the effectiveness of our method. Code is available at https://github.com/naver-ai/seit.

SeiT++: Masked Token Modeling Improves Storage-efficient Training

TL;DR

Abstract

Paper Structure (41 sections, 4 equations, 9 figures, 10 tables)

This paper contains 41 sections, 4 equations, 9 figures, 10 tables.

Introduction
Related Work
Learning with Tokenization.
Storage-efficient Vision Training.
Data Augmentation.
Method
Preliminary: Storage-efficient Vision Training (SeiT)
SeiT
Preparing the Token Dataset.
Training Pipeline.
Masked Token Modeling
Masking.
MTM Encoder and Decoder.
Training Objective.
Data Augmentation for Tokens
...and 26 more sections

Figures (9)

Figure 1: Over 70% top-1 accuracy is achievable with just 1GB data. On ImageNet-1k, we visualize the trade-off of training data storage vs. top-1 accuracy using the fixed ViT-B/16 for a controlled comparison. Each accuracy metric is individually trained with different data types. Consider that while the entire ImageNet-1k dataset requires approximately 140GB for training with images, our approach demonstrates significant storage efficiency over competitors.
Figure 2: Masked Token Modeling (MTM) pipeline. MTM is a self-supervised learning approach in token-based frameworks. (a) The tokenized dataset is saved on storage before model training. Then, (b) using only the pre-stored tokens, a storage-efficient vision model (MTM) is trained without relying on labeled datasets.
Figure 3: Data augmentations with tokens. Each ViT-VQGAN decoded image is reconstructed from a given RGB image after undergoing specific data augmentations. We observe that naively adopting these methods results in incorrect tokenization: 1) Token w/ hFlip demonstrates spatial information collapse during tokenization; 2) Token w/ RRC shows interdependence between neighboring token embeddings. We note the reconstructed images fail to preserve the images' details, suggesting that this incurs ineffectiveness of tokenization. Furthermore, we report top-1 accuracies (ViT-B) on ImageNet-1k according to the augmentations applied during training. All the augmentations are de-facto default training setups for vision transformers deitaugregswin.
Figure 4: TokenAdapt processing pipeline. TokenAdapt aims to enhance the compatibility of token embeddings with pixel-based data augmentations by converting them into augmentation-compatible space, applying augmentations, and reverting them back to the original token embedding space.
Figure 5: ColorAdapt provides more reasonable color changes. We present ViT-VQGAN decoded images to verify the quality of tokenizations after color changes. We use the brightness function with a factor of 0.2 following the implementation kornia. Emb-Noise seit is the color-based token augmentation. Notably, our ColorAdapt effectively preserves object structure in contrast to the failure of the counterparts.
...and 4 more figures

SeiT++: Masked Token Modeling Improves Storage-efficient Training

TL;DR

Abstract

SeiT++: Masked Token Modeling Improves Storage-efficient Training

Authors

TL;DR

Abstract

Table of Contents

Figures (9)