Table of Contents
Fetching ...

Scaling White-Box Transformers for Vision

Jinrui Yang, Xianhang Li, Druv Pai, Yuyin Zhou, Yi Ma, Yaodong Yu, Cihang Xie

TL;DR

This paper demonstrates the scalable training of CRATE-α, a white-box vision transformer built on the CRATE framework, by introducing an overcomplete and decoupled dictionary-based sparse coding block and a residual connection. The combined architectural edits and a light training recipe enable substantial scaling from Base to Huge, achieving 85.1% top-1 on ImageNet-1K with IN-21K pretraining and 72.3% zero-shot accuracy with DataComp1B, while preserving, and often improving, semantic interpretability such as zero-shot segmentation. The work provides a principled path to scaling mathematically interpretable models, offering competitive performance with ViTs under comparable compute and enabling broader applications in vision-language pretraining and downstream tasks. Overall, CRATE-α advances scalable, interpretable deep nets by balancing compression, sparsity, and expressive capacity through unrolled optimization grounded in Sparse Rate Reduction.

Abstract

CRATE, a white-box transformer architecture designed to learn compressed and sparse representations, offers an intriguing alternative to standard vision transformers (ViTs) due to its inherent mathematical interpretability. Despite extensive investigations into the scaling behaviors of language and vision transformers, the scalability of CRATE remains an open question which this paper aims to address. Specifically, we propose CRATE-$α$, featuring strategic yet minimal modifications to the sparse coding block in the CRATE architecture design, and a light training recipe designed to improve the scalability of CRATE. Through extensive experiments, we demonstrate that CRATE-$α$ can effectively scale with larger model sizes and datasets. For example, our CRATE-$α$-B substantially outperforms the prior best CRATE-B model accuracy on ImageNet classification by 3.7%, achieving an accuracy of 83.2%. Meanwhile, when scaling further, our CRATE-$α$-L obtains an ImageNet classification accuracy of 85.1%. More notably, these model performance improvements are achieved while preserving, and potentially even enhancing the interpretability of learned CRATE models, as we demonstrate through showing that the learned token representations of increasingly larger trained CRATE-$α$ models yield increasingly higher-quality unsupervised object segmentation of images. The project page is https://rayjryang.github.io/CRATE-alpha/.

Scaling White-Box Transformers for Vision

TL;DR

This paper demonstrates the scalable training of CRATE-α, a white-box vision transformer built on the CRATE framework, by introducing an overcomplete and decoupled dictionary-based sparse coding block and a residual connection. The combined architectural edits and a light training recipe enable substantial scaling from Base to Huge, achieving 85.1% top-1 on ImageNet-1K with IN-21K pretraining and 72.3% zero-shot accuracy with DataComp1B, while preserving, and often improving, semantic interpretability such as zero-shot segmentation. The work provides a principled path to scaling mathematically interpretable models, offering competitive performance with ViTs under comparable compute and enabling broader applications in vision-language pretraining and downstream tasks. Overall, CRATE-α advances scalable, interpretable deep nets by balancing compression, sparsity, and expressive capacity through unrolled optimization grounded in Sparse Rate Reduction.

Abstract

CRATE, a white-box transformer architecture designed to learn compressed and sparse representations, offers an intriguing alternative to standard vision transformers (ViTs) due to its inherent mathematical interpretability. Despite extensive investigations into the scaling behaviors of language and vision transformers, the scalability of CRATE remains an open question which this paper aims to address. Specifically, we propose CRATE-, featuring strategic yet minimal modifications to the sparse coding block in the CRATE architecture design, and a light training recipe designed to improve the scalability of CRATE. Through extensive experiments, we demonstrate that CRATE- can effectively scale with larger model sizes and datasets. For example, our CRATE--B substantially outperforms the prior best CRATE-B model accuracy on ImageNet classification by 3.7%, achieving an accuracy of 83.2%. Meanwhile, when scaling further, our CRATE--L obtains an ImageNet classification accuracy of 85.1%. More notably, these model performance improvements are achieved while preserving, and potentially even enhancing the interpretability of learned CRATE models, as we demonstrate through showing that the learned token representations of increasingly larger trained CRATE- models yield increasingly higher-quality unsupervised object segmentation of images. The project page is https://rayjryang.github.io/CRATE-alpha/.
Paper Structure (16 sections, 12 equations, 10 figures, 9 tables)

This paper contains 16 sections, 12 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: (Left) We demonstrate how modifications to the components enhance the performance of the crate model. The four models are trained using the same setup: first pre-trained on ImageNet-21K and then fine-tuned on ImageNet-1K. Details are provided in Section \ref{['sec:crate-alpha']}. (Right). We compare the FLOPs and accuracy on ImageNet-1K of our methods with ViT dosovitskiy2020image and CRATE yu2023white. The values of crate-$\alpha$ model correspond to those presented in Table \ref{['tab:ours_model_result_on1k']}. A more detailed comparison between crate-$\alpha$ and ViT is included in Appendix \ref{['sec:comparison_vit']}.
  • Figure 2: One layer of the crate-$\alpha$ model architecture. $\operatorname{\texttt{MSSA}}$ (Multi-head Subspace Self-Attention, defined in \ref{['eq:mssa']}) represents the compression block, and ODL (Overcomplete Dictionary Learning, defined in \ref{['eq:odl']}) represents the sparse coding block. A more detailed illustration of the modifications is provided in Fig. \ref{['fig:exp-rc-sparisty-small-new']} in the Appendix .
  • Figure 3: Training loss curves of crate-$\alpha$ on ImageNet-21K. (Left) Comparing training loss curves across crate-$\alpha$ with different model sizes. (Right) Comparing training loss curves across crate-$\alpha$-Large with different patch sizes.
  • Figure 4: (Left) Comparing training loss curves of crate-$\alpha$-CLIPA with different model sizes on DataComp1B. (Right) Comparing zero-shot accuracy of crate-$\alpha$-B/L/H models and ViT-H on ImageNet-1K.
  • Figure 5: Visualization of segmentation on COCO val2017 lin2014microsoft with MaskCut wang2023cut. (Top row) Supervised crate-$\alpha$ effectively identifies the main objects in the image. Compared with crate (Middle row), crate-$\alpha$ achieves better segmentation performance in terms of boundary. (Bottom row) Supervised ViT fails to identify the main objects in most images. We mark failed image with .
  • ...and 5 more figures