Table of Contents
Fetching ...

Masked Completion via Structured Diffusion with White-Box Transformers

Druv Pai, Ziyang Wu, Sam Buchanan, Yaodong Yu, Yi Ma

TL;DR

This work tackles unsupervised representation learning with interpretable, structured models by introducing CRATE-MAE, a white-box transformer-like autoencoder derived from unrolling a sparse rate-reduction objective and linking compression with diffusion-inspired denoising. It develops a distributional signal model for token representations, constructs distributionally-invertible encoder and decoder layers based on MSSA and ISTA blocks, and frames a time-reversed diffusion process to enable deterministic autoencoding. Empirically, CRATE-MAE achieves competitive performance on large-scale imagery with roughly 30% of the parameters of standard masked autoencoders and reveals semantically meaningful, linearly structured representations and interpretable attention maps. The approach bridges diffusion, rate reduction, and transformer design, showing that principled white-box architectures can be effective for unsupervised learning and scalable vision tasks.

Abstract

Modern learning frameworks often train deep neural networks with massive amounts of unlabeled data to learn representations by solving simple pretext tasks, then use the representations as foundations for downstream tasks. These networks are empirically designed; as such, they are usually not interpretable, their representations are not structured, and their designs are potentially redundant. White-box deep networks, in which each layer explicitly identifies and transforms structures in the data, present a promising alternative. However, existing white-box architectures have only been shown to work at scale in supervised settings with labeled data, such as classification. In this work, we provide the first instantiation of the white-box design paradigm that can be applied to large-scale unsupervised representation learning. We do this by exploiting a fundamental connection between diffusion, compression, and (masked) completion, deriving a deep transformer-like masked autoencoder architecture, called CRATE-MAE, in which the role of each layer is mathematically fully interpretable: they transform the data distribution to and from a structured representation. Extensive empirical evaluations confirm our analytical insights. CRATE-MAE demonstrates highly promising performance on large-scale imagery datasets while using only ~30% of the parameters compared to the standard masked autoencoder with the same model configuration. The representations learned by CRATE-MAE have explicit structure and also contain semantic meaning. Code is available at https://github.com/Ma-Lab-Berkeley/CRATE .

Masked Completion via Structured Diffusion with White-Box Transformers

TL;DR

This work tackles unsupervised representation learning with interpretable, structured models by introducing CRATE-MAE, a white-box transformer-like autoencoder derived from unrolling a sparse rate-reduction objective and linking compression with diffusion-inspired denoising. It develops a distributional signal model for token representations, constructs distributionally-invertible encoder and decoder layers based on MSSA and ISTA blocks, and frames a time-reversed diffusion process to enable deterministic autoencoding. Empirically, CRATE-MAE achieves competitive performance on large-scale imagery with roughly 30% of the parameters of standard masked autoencoders and reveals semantically meaningful, linearly structured representations and interpretable attention maps. The approach bridges diffusion, rate reduction, and transformer design, showing that principled white-box architectures can be effective for unsupervised learning and scalable vision tasks.

Abstract

Modern learning frameworks often train deep neural networks with massive amounts of unlabeled data to learn representations by solving simple pretext tasks, then use the representations as foundations for downstream tasks. These networks are empirically designed; as such, they are usually not interpretable, their representations are not structured, and their designs are potentially redundant. White-box deep networks, in which each layer explicitly identifies and transforms structures in the data, present a promising alternative. However, existing white-box architectures have only been shown to work at scale in supervised settings with labeled data, such as classification. In this work, we provide the first instantiation of the white-box design paradigm that can be applied to large-scale unsupervised representation learning. We do this by exploiting a fundamental connection between diffusion, compression, and (masked) completion, deriving a deep transformer-like masked autoencoder architecture, called CRATE-MAE, in which the role of each layer is mathematically fully interpretable: they transform the data distribution to and from a structured representation. Extensive empirical evaluations confirm our analytical insights. CRATE-MAE demonstrates highly promising performance on large-scale imagery datasets while using only ~30% of the parameters compared to the standard masked autoencoder with the same model configuration. The representations learned by CRATE-MAE have explicit structure and also contain semantic meaning. Code is available at https://github.com/Ma-Lab-Berkeley/CRATE .
Paper Structure (36 sections, 10 theorems, 166 equations, 11 figures, 6 tables)

This paper contains 36 sections, 10 theorems, 166 equations, 11 figures, 6 tables.

Key Result

Theorem 1

Suppose $\bm{Z}$ follows the noisy Gaussian codebook model model:gaussian_tokens_noise, with infinitesimal noise level $\sigma^{\ell} > 0$ and subspace memberships $s_{i}$ distributed as i.i.d. categorical random variables on the set of subspace indices $\{1, \dots, K\}$, independently of all other

Figures (11)

  • Figure 1: Diagram of the overall white-box crate-mae pipeline, illustrating the end-to-end (masked) autoencoding process. The token representations are transformed iteratively towards a parsimonious (e.g., compressed and sparse) representation by each encoder layer $f^{\ell}$. Furthermore, such representations are transformed back to the original image by the decoder layers $g^{\ell}$. Each encoder layer $f^{\ell}$ is meant to be (partially) inverted by a corresponding decoder layer $g^{L - \ell}$.
  • Figure 2: The compression-sparsification iteration implemented by each layer of crate, and each encoder layer of crate-mae. The compression step, implemented by the $\operatorname{\texttt{MSSA}}$ operator, projects the tokens $\bm{Z}^{\ell}$ towards the subspace model $\bm{U}_{[K]}^{\ell}$ to form $\bm{Z}^{\ell + 1/2}$. The sparsification step, implemented by the $\operatorname{\texttt{ISTA}}$ operator, rotates the tokens in $\bm{Z}^{\ell + 1/2}$ towards the coordinate axes, using the sparsifying dictionary $\bm{D}^{\ell}$, to get $\bm{Z}^{\ell + 1}$. The steps are performed in sequence and comprise a single of the crate-mae encoder.
  • Figure 3: Compression and denoising against the low-dimensional Gaussian mixture token model \ref{['model:gaussian_tokens']} are equivalent.Left: the effect of compression against the low-dimensional Gaussian mixture model for tokens \ref{['model:gaussian_tokens']}, i.e., taking gradient steps on the coding rate $R^{c}(\cdot \mid \bm{U}_{[K]})$ --- or equivalently, using the $\operatorname{\texttt{MSSA}}(\cdot \mid \bm{U}_{[K]})$ operator --- which is shown in \ref{['thm:informal_rate_score']} to be equivalent to projecting onto the $\bm{U}_{[K]}$. Right: the effect of denoising against \ref{['model:gaussian_tokens']}, i.e., taking gradient steps on the score function of the noisy model \ref{['model:gaussian_tokens_noise']} at small noise levels $\sigma$, or equivalently small times $t$. Up to scaling factors (not pictured), these two operations are equivalent, and have similar geometric and statistical interpretations as a projection onto the support of the data distribution. This connection motivates our structured denoising-diffusion framework, as elaborated in \ref{['sub:unification']}.
  • Figure 4: Diagram of each encoder layer (top) and decoder layer (bottom) in crate-mae. Notice that the two layers are highly anti-parallel --- each is constructed to do the operations of the other in reverse order. That is, in the decoder layer $g^{\ell}$, the $\operatorname{\texttt{ISTA}}$ block of $f^{L - \ell}$ is partially inverted first using a linear layer, then the $\operatorname{\texttt{MSSA}}$ block of $f^{L - \ell}$ is reversed; this order unravels the transformation done in $f^{L - \ell}$.
  • Figure 5: Left: The compression measure $R^{c}(\bm{Z}^{\ell+1/2} \mid \bm{U}_{[K]}^{\ell})$ at different layers of the encoder. Right: the sparsity measure $\|\bm{Z}^{\ell+1}\|_0 / (d\cdot N)$, at different layers of the encoder. Measurements were collected from crate-mae-Base averaged over $10000$ randomly chosen ImageNet samples. We observe that the compression and sparsity improve consistently over each layer and through the whole network.
  • ...and 6 more figures

Theorems & Definitions (23)

  • Theorem 1: Informal version of \ref{['lem:inverse-term']} in \ref{['app:computations_rr_gradient']}
  • Theorem 3
  • Remark 4
  • Remark 5
  • Remark 6
  • proof : Proof of \ref{['lem:inverse-term']}
  • Lemma 7
  • proof
  • Lemma 8
  • proof
  • ...and 13 more