Table of Contents
Fetching ...

AToken: A Unified Tokenizer for Vision

Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, Yinfei Yang

TL;DR

AToken proposes a first-of-its-kind unified visual tokenizer that spans images, videos, and 3D assets by encoding inputs into a shared 4D latent space with 4D rotary embeddings. It relies on a pure transformer with adversarial-free training (Gram and perceptual losses) and a four-stage progressive curriculum to jointly optimize reconstruction and semantic understanding across all modalities. The model demonstrates state-of-the-art-like reconstruction quality, strong semantic alignment, and competitive downstream capabilities, including multimodal LLM integration, image/video/3D generation, and text-to-3D synthesis. This work paves the way for scalable, cross-modal vision foundations, enabling generation and understanding within a single compact framework across diverse visual domains.

Abstract

We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs into a shared 4D latent space, unifying both tasks and modalities in a single framework. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 40.2% MSRVTT retrieval for videos, and 28.28 PSNR with 90.9% classification accuracy for 3D.. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving competitive performance across all benchmarks. These results shed light on the next-generation multimodal AI systems built upon unified visual tokenization.

AToken: A Unified Tokenizer for Vision

TL;DR

AToken proposes a first-of-its-kind unified visual tokenizer that spans images, videos, and 3D assets by encoding inputs into a shared 4D latent space with 4D rotary embeddings. It relies on a pure transformer with adversarial-free training (Gram and perceptual losses) and a four-stage progressive curriculum to jointly optimize reconstruction and semantic understanding across all modalities. The model demonstrates state-of-the-art-like reconstruction quality, strong semantic alignment, and competitive downstream capabilities, including multimodal LLM integration, image/video/3D generation, and text-to-3D synthesis. This work paves the way for scalable, cross-modal vision foundations, enabling generation and understanding within a single compact framework across diverse visual domains.

Abstract

We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs into a shared 4D latent space, unifying both tasks and modalities in a single framework. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 40.2% MSRVTT retrieval for videos, and 28.28 PSNR with 90.9% classification accuracy for 3D.. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving competitive performance across all benchmarks. These results shed light on the next-generation multimodal AI systems built upon unified visual tokenization.

Paper Structure

This paper contains 44 sections, 7 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Illustration of our method on different visual modalities. Given images, videos, and 3D assets, AToken leverages a shared 4D latent space (left) to produce high-fidelity reconstructions (middle: zoomed regions with red boxes for images, temporal frames for videos, multiple viewpoints for 3D) while preserving strong semantic understanding (right: showing text-aligned representations for zero-shot text retrieval).
  • Figure 2: Overview of our method. All modalities undergo unified space-time patchification and encoding into sparse 4D latents, which support both reconstruction through modality-specific decoders and understanding through attention pooling and text alignment. The architecture jointly optimizes reconstruction and understanding losses, maintaining sparse structured representations throughout for efficient multimodal processing.
  • Figure 3: 3D tokenization pipeline. We extend Trellis-SLAT xiang2024structured for multimodal unification through two modifications: directly tokenizing raw RGB patches from multiview renderings (as opposed to using DINOv2 features), and aggregating each voxel's features from its nearest viewpoint (as opposed to averaging across all views). Combined with Gaussian decoding, this approach integrates 3D assets into our unified token space alongside images and videos.
  • Figure 4: Adversarial-free training with Gram loss achieves stable, high-fidelity reconstruction. (a) GAN training fails in our setting: the discriminator overpowers the generator, causing diverging logits and degraded rFID. (b) Decomposing rFID reveals $\approx86.6\%$ of error stems from covariance (texture/style) vs. $\approx13.4\%$ from mean components. (c) Gram loss directly optimizes second-order statistics (i.e., feature covariance) without adversarial training, achieving superior and stable rFID throughout training.
  • Figure 5: Progressive training curriculum of AToken. Our model starts from SigLIP2 image understanding and progressively adds: (1) image reconstruction, (2) video capabilities with temporal modeling, (3) 3D understanding with expanded resolutions, and optionally (4) discrete tokenization via FSQ. Each box shows the new capabilities introduced at that stage, along with supported resolutions, patch sizes, and sampling strategies.
  • ...and 5 more figures