Table of Contents
Fetching ...

REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Giorgos Petsangourakis, Christos Sgouropoulos, Bill Psomas, Theodoros Giannakopoulos, Giorgos Sfikas, Ioannis Kakogeorgiou

TL;DR

Latent diffusion models struggle to extract high-level semantics within reconstruction-based training. REGLUE unifies three semantic streams—VAE latents, local VFM patch features, and global CLS tokens—inside a single Scalable Interpolant Transformer, aided by a lightweight nonlinear semantic compressor and an external alignment loss. The approach yields substantial gains on ImageNet 256×256, accelerating convergence and improving FID, sFID, and related metrics over prior REG/ReDi baselines, with local semantics and nonlinear compression driving the largest improvements. The findings highlight the importance of spatial, multi-layer VFM information and demonstrate that global tokens provide complementary benefits when fused with local, latent guidance. This framework offers a practical, data-efficient path toward higher-fidelity diffusion generation with modest additional compute.

Abstract

Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at https://github.com/giorgospets/reglue .

REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

TL;DR

Latent diffusion models struggle to extract high-level semantics within reconstruction-based training. REGLUE unifies three semantic streams—VAE latents, local VFM patch features, and global CLS tokens—inside a single Scalable Interpolant Transformer, aided by a lightweight nonlinear semantic compressor and an external alignment loss. The approach yields substantial gains on ImageNet 256×256, accelerating convergence and improving FID, sFID, and related metrics over prior REG/ReDi baselines, with local semantics and nonlinear compression driving the largest improvements. The findings highlight the importance of spatial, multi-layer VFM information and demonstrate that global tokens provide complementary benefits when fused with local, latent guidance. This framework offers a practical, data-efficient path toward higher-fidelity diffusion generation with modest additional compute.

Abstract

Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at https://github.com/giorgospets/reglue .

Paper Structure

This paper contains 55 sections, 12 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: REGLUE fast convergence. Qualitative evolution of SiT-B/2+REGLUE at 50K/100K/200K/400K training steps. All use identical noise, the same sampling schedule/step count, and no classifier-free guidance. REGLUE achieves high fidelity early.
  • Figure 2: Semantic compressor architecture and training. The representations from the last four layers of the vision foundation model (VFM) encoder are concatenated and passed to the compression model, which projects them into a compact 16-channel semantic representation. In our default configuration (corresponding to the middle row of \ref{['tab:model_size']}), the compressor maps the dense concatenated VFM features through an input layer Conv2D(3072, 256), a middle ResidualBlock(256, 256), and an output layer Conv2D(256, 16), where 256 is the hidden dimensionality. The semantic de-compressor then reconstructs the compact semantics back to their original dimensionality. The model is trained using an MSE loss between the dense concatenated features and their reconstructed counterparts.
  • Figure 3: Attentive probing accuracy vs. generation quality on ImageNet for different DINOv2 patch-level compression variants. Each point shows top-1 attentive probing accuracy psomas2025attention and FID of the corresponding SiT model, with bubble area proportional to the semantic feature dimensionality. Our non-linear semantic compressors (8 and 16 channels) achieve substantially better FID at higher probing accuracy than the PCA-compressed features of ReDi, while the vertical dashed line marks the accuracy of the full 768-channel DINOv2 representation.
  • Figure 4: Performance vs. compression channels. Ablation of the final compression channels, in DINOv2 last layer's representation, using SiT-B/2 trained for 400K steps without REPA loss.
  • Figure 5: Semantic segmentation performance mIoU vs generation quality for different DINOv2 patch-level compression variants. Each point shows the segmentation mIoU on Cityscapes Cordts_2016_CVPR using a DPT ranftl2021vision head on frozen features following implementation from karypidis2025dinoforesightyang_depthyang2024depth and the FID on ImageNet of the corresponding SiT model. Bubble area is proportional to feature dimensionality. Our non-linear semantic compressors (8 and 16 channels) achieve substantially better FID at higher mIoU than the PCA-compressed features of ReDi. The vertical dashed line indicates the mIoU of the full 768-channel DINOv2 representation.
  • ...and 4 more figures