Table of Contents
Fetching ...

V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

Han Lin, Xichen Pan, Zun Wang, Yue Zhang, Chu Wang, Jaemin Cho, Mohit Bansal

Abstract

Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.

V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

Abstract

Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.
Paper Structure (15 sections, 17 equations, 8 figures, 13 tables)

This paper contains 15 sections, 17 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: An overview of V-Co and its recipe. Starting from a pixel diffusion model, a pretrained DINOv2 encoder, and training images, we identify four key ingredients for effective visual co-denoising: a fully dual-stream architecture, semantic-to-pixel masking for classifier-free guidance, a perceptual-drifting hybrid loss for stronger semantic supervision, and RMS-based feature rescaling for cross-stream calibration. Together, they form a simple and effective recipe for visual co-denoising.
  • Figure 2: Single-stream and dual-stream architectures for visual co-denoising. In the single-stream design (left), noised pixels and DINOv2 features are fused after lightweight stream-specific preprocessing and then processed by shared JiT blocks. We study direct addition, channel concatenation, and token concatenation (see \ref{['subsec:model_architecture']}). In the dual-stream design (right), the two streams use separate normalization, MLP, and attention projections, while interacting through joint self-attention. A semantic-to-pixel attention mask is used to define the unconditional prediction for CFG (see \ref{['subsec:model_architecture_unconditional_generation']}). Both designs use separate output heads for pixel and DINOv2 prediction.
  • Figure 3: Comparison of two attention-masking strategies. Yellow tokens indicate the corresponding query and attended key/value tokens, while white tokens indicate positions whose attention scores are masked out.
  • Figure 4: Influence of the DINO diffusion loss coefficient $\lambda_d$. See \ref{['subsec:auxiliary_loss']} for details.
  • Figure 5: Comparison of guided FID (i.e., FID computed from samples generated with CFG).
  • ...and 3 more figures