Table of Contents
Fetching ...

DINO-Tok: Adapting DINO for Visual Tokenizers

Mingkai Jia, Mingxiao Li, Liaoyuan Fan, Tianxing Shi, Jiaxin Guo, Zeming Li, Xiaoyang Guo, Xiao-Xiao Long, Qian Zhang, Ping Tan, Wei Yin

TL;DR

DINO-Tok introduces a representation-driven visual tokenizer that unifies shallow texture cues with deep semantic DINO features to form an information-complete latent space for both continuous (AE) and discrete (VQ) tokenizations. A global PCA reweighting mechanism stabilizes high-dimensional vector quantization by emphasizing high-variance channels and using two specialized codebooks for semantics and texture. Empirical results on ImageNet-256 reveal state-of-the-art reconstruction (AE: 28.54 PSNR; VQ: 23.98 PSNR) and strong zero-shot generalization, with effective diffusion-based generation when integrated into VAVAE frameworks. The approach demonstrates that pretrained vision models can be effectively repurposed as visual tokenizers, yielding semantically faithful, high-fidelity latent representations suitable for next-generation generative models.

Abstract

Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing tokenizers are typically trained from scratch and struggle to balance semantic representation and reconstruction fidelity, particularly in high-dimensional latent spaces. In this work, we introduce DINO-Tok, a DINO-based visual tokenizer that unifies hierarchical representations into an information-complete latent space. By integrating shallow features that retain fine-grained details with deep features encoding global semantics, DINO-Tok effectively bridges pretrained representations and visual generation. We further analyze the challenges of vector quantization (VQ) in this high-dimensional space, where key information is often lost and codebook collapse occurs. We thus propose a global PCA reweighting mechanism to stabilize VQ and preserve essential information across dimensions. On ImageNet 256$\times$256, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling, significantly outperforming prior tokenizers and comparable to billion-level data trained models (such as Hunyuan and Wan). These results demonstrate that adapting powerful pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, enabling next-generation visual generative models. Code will be publicly available at https://github.com/MKJia/DINO-Tok.

DINO-Tok: Adapting DINO for Visual Tokenizers

TL;DR

DINO-Tok introduces a representation-driven visual tokenizer that unifies shallow texture cues with deep semantic DINO features to form an information-complete latent space for both continuous (AE) and discrete (VQ) tokenizations. A global PCA reweighting mechanism stabilizes high-dimensional vector quantization by emphasizing high-variance channels and using two specialized codebooks for semantics and texture. Empirical results on ImageNet-256 reveal state-of-the-art reconstruction (AE: 28.54 PSNR; VQ: 23.98 PSNR) and strong zero-shot generalization, with effective diffusion-based generation when integrated into VAVAE frameworks. The approach demonstrates that pretrained vision models can be effectively repurposed as visual tokenizers, yielding semantically faithful, high-fidelity latent representations suitable for next-generation generative models.

Abstract

Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing tokenizers are typically trained from scratch and struggle to balance semantic representation and reconstruction fidelity, particularly in high-dimensional latent spaces. In this work, we introduce DINO-Tok, a DINO-based visual tokenizer that unifies hierarchical representations into an information-complete latent space. By integrating shallow features that retain fine-grained details with deep features encoding global semantics, DINO-Tok effectively bridges pretrained representations and visual generation. We further analyze the challenges of vector quantization (VQ) in this high-dimensional space, where key information is often lost and codebook collapse occurs. We thus propose a global PCA reweighting mechanism to stabilize VQ and preserve essential information across dimensions. On ImageNet 256256, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling, significantly outperforming prior tokenizers and comparable to billion-level data trained models (such as Hunyuan and Wan). These results demonstrate that adapting powerful pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, enabling next-generation visual generative models. Code will be publicly available at https://github.com/MKJia/DINO-Tok.

Paper Structure

This paper contains 31 sections, 9 equations, 21 figures, 6 tables.

Figures (21)

  • Figure 1: How to Adapt DINO for Visual Tokenizers? (i) Visual results (top) show reconstruction details: while distilling DINO features (VA-VAE, VFM-Tok) degrades semantics and limits reconstruction, using frozen DINO encoder introduces severe artifacts like color shifts (see red dash box in DINO-Dec) and semantic replacement (see red toy in mouth in DINO-VQ v.s. blue toy in GT). (ii) PCA of the latent for reconstruction (bottom) show semantic preservation: distilling is affected by RGB information (see VFM-Tok, brown ear and face show similar semantic latent), and direct VQ becomes noisy. (iii) Our method resolves this, restoring texture via dual-branch (see characters) and preserving critical semantic information (distinguish ears, face, and leg) with reweighted VQ, achieving superior detail preservation and a semantically structured latent space.
  • Figure 2: PCA visualizations across 12 layers of DINOv3. As depth increases, the feature distribution becomes more structured and semantically clustered, while fine-grained image details diminish. This suggests that deeper DINO layers encode increasingly abstract and semantically disentangled representations.
  • Figure 3: DINO-Tok framework:DINO-Tok(AE) and DINO-Tok(VQ). In the AE branch, a frozen DINO encoder provides a dual-branch representation: a shallow feature map $\mathbf{F}_1$ capturing fine texture and color information is projected to 64 dimensions and concatenated with the last-layer feature $\mathbf{F}_L$, enabling reconstruction that preserving low-level fidelity. In contrast, the VQ branch employs a Global PCA Reweighting$w$ on DINO feature $\mathbf{F}_L$ to reweight channels by their global variance, guiding the codebook lookup toward critical semantic dimensions. To balance semantic and visual detail, the VQ pathway adopts two separate codebooks design: a semantic codebook focuses on high-variance channels emphasized by the PCA weights, while a texture codebook refines fine-grained appearance cues. This design ensures that essential high-level semantics are retained in quantization while maintaining reconstructive quality.
  • Figure 4: Visual comparison of DINO reconstructions. (ii) Directly applying frozen DINO as an encoder shows apparent color shift and lacks of fine details. (iii) Ours DINO-Tok-AE restores texture via dual branches and keep faithful color.
  • Figure 5: Visual comparison of VQ reconstructions. (ii) Vanilla VQ on DINO features suffers several issues. Semantics replacement: Local semantics and textures are confused, replacing the mushroom entirely; Semantics overlap: The near stump is mistakenly recognized as part of the behind tree trunk. (iii) VQ with reweighting resolves these issues, preserving key semantic information. (iv) Dual-branch design combined with reweighting achieves faithful reconstruction.
  • ...and 16 more figures