Table of Contents
Fetching ...

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Yitong Chen, Zuxuan Wu, Xipeng Qiu, Yu-Gang Jiang

TL;DR

CaTok, a 1D causal image tokenizer with a MeanFlow decoder, is presented, which learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals.

Abstract

Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

TL;DR

CaTok, a 1D causal image tokenizer with a MeanFlow decoder, is presented, which learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals.

Abstract

Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.
Paper Structure (17 sections, 16 equations, 6 figures, 5 tables)

This paper contains 17 sections, 16 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Reconstruction samples.CaTok with a MeanFlow decoder meanflow supports fast one-step (col. 2) and high-quality multi-step (col. 3) sampling with 256 tokens. Reconstructions in cols. 3–7 show a fine-to-coarse trend as tokens are reduced from 256 to 16, highlighting the causality of the 1D tokens. Cols. 7–10 present reconstructions from different 16-token segments, demonstrating that CaTok naturally learns diverse visual concepts across token intervals.
  • Figure 2: Comparison among different decoders.a) Naïve flow decoders flowmo condition on all 1D tokens from the encoder without dropout, leading the 1D tokens to lack causality; b) Consistency decoders obtain $k$ by random sampling flextoksemanticist or timestep binding ddtselftok, and condition on the first $k$ 1D tokens, which biases toward early tokens, introducing imbalance, leading to degraded performance of AR generation; c) Our MeanFlow decoder conditions on 1D tokens within the time interval $[r,t]$ to model the average velocity field along the subpath, which inherently maintains causality and balance of the 1D visual tokens, and supporting one-step sampling during image reconstruction or generation.
  • Figure 3: Architecture of our CaTok.CaTok is a diffusion autoencoder with a causal Vision Transformer (ViT) vit encoder and a MeanFlow Diffusion Transformer (DiT) dit decoder. The encoder leverages registers registers to extract rich visual information into 1D tokens, which are then conditioned to the decoder through time interval selecting. With two flow objectives and two representation alignment objectives, CaTok learns causal 1D representations that support both one-step and multi-step sampling, while naturally capturing diverse visual concepts across different token intervals.
  • Figure 4: We visualize the causal mask mechanism in ViT in a). After training CaTok, we freeze the encoder to extract 1D tokens. During AR training stage, these tokens are optimized with a class token prefix using teacher forcing under a diffusion loss mar. At sampling time, we input a learned class token, the AR model predicts the corresponding visual 1D tokens, and these tokens are then conditioned to the decoder for generation.
  • Figure 5: Qualitative Results. 256$\times$256 generated images on ImageNet-1K with CaTok-L-32.
  • ...and 1 more figures