Table of Contents
Fetching ...

LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection

Ana Vasilcoiu, Ivona Najdenkoska, Zeno Geradts, Marcel Worring

TL;DR

LATTE introduces latent trajectory embedding to detect diffusion-generated images by modeling the evolution of latent representations across multiple denoising steps. It extracts one-step-denoised latent states at selected timesteps, fuses them with visual features via transformer decoders, and aggregates them into a discriminative representation that is classified with a lightweight oracle. Across GenImage, Chameleon, and Diffusion Forensics, LATTE achieves state-of-the-art performance, with strong cross-generator and cross-domain robustness, including large gains on challenging subsets. The approach highlights latent trajectory modeling as a powerful direction for forensic detection of synthetic media with practical implications for digital trust and media verification.

Abstract

The rapid advancement of diffusion-based image generators has made it increasingly difficult to distinguish generated from real images. This erodes trust in digital media, making it critical to develop generated image detectors that remain reliable across different generators. While recent approaches leverage diffusion denoising cues, they typically rely on single-step reconstruction errors and overlook the sequential nature of the denoising process. In this work, we propose LATTE - LATent Trajectory Embedding - a novel approach that models the evolution of latent embeddings across multiple denoising steps. Instead of treating each denoising step in isolation, LATTE captures the trajectory of these representations, revealing subtle and discriminative patterns that distinguish real from generated images. Experiments on several benchmarks, such as GenImage, Chameleon, and Diffusion Forensics, show that LATTE achieves superior performance, especially in challenging cross-generator and cross-dataset scenarios, highlighting the potential of latent trajectory modeling. The code is available on the following link: https://github.com/AnaMVasilcoiu/LATTE-Diffusion-Detector.

LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection

TL;DR

LATTE introduces latent trajectory embedding to detect diffusion-generated images by modeling the evolution of latent representations across multiple denoising steps. It extracts one-step-denoised latent states at selected timesteps, fuses them with visual features via transformer decoders, and aggregates them into a discriminative representation that is classified with a lightweight oracle. Across GenImage, Chameleon, and Diffusion Forensics, LATTE achieves state-of-the-art performance, with strong cross-generator and cross-domain robustness, including large gains on challenging subsets. The approach highlights latent trajectory modeling as a powerful direction for forensic detection of synthetic media with practical implications for digital trust and media verification.

Abstract

The rapid advancement of diffusion-based image generators has made it increasingly difficult to distinguish generated from real images. This erodes trust in digital media, making it critical to develop generated image detectors that remain reliable across different generators. While recent approaches leverage diffusion denoising cues, they typically rely on single-step reconstruction errors and overlook the sequential nature of the denoising process. In this work, we propose LATTE - LATent Trajectory Embedding - a novel approach that models the evolution of latent embeddings across multiple denoising steps. Instead of treating each denoising step in isolation, LATTE captures the trajectory of these representations, revealing subtle and discriminative patterns that distinguish real from generated images. Experiments on several benchmarks, such as GenImage, Chameleon, and Diffusion Forensics, show that LATTE achieves superior performance, especially in challenging cross-generator and cross-dataset scenarios, highlighting the potential of latent trajectory modeling. The code is available on the following link: https://github.com/AnaMVasilcoiu/LATTE-Diffusion-Detector.

Paper Structure

This paper contains 27 sections, 7 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Extraction of LATTE representation. We construct the LATTE sequence by performing a single-step reconstruction for a selection of timesteps throughout the whole trajectory.
  • Figure 2: Overview of our proposed architecture using LATTE. It encompasses two stages: (1) Latent–Visual Fusion, where the LATTE is fused with visual semantics through stacks of $L$ cross-attention layers, and (2) Latent-Visual Classifier for average aggregation and output prediction.
  • Figure 3: Comparison of LATTE to baselines, by training and testing across all 8 generators of GenImage. Each plot corresponds to one detector - DIRE (left; baseline), LaRE (center; baseline), and LATTE (right; proposed) - and shows the accuracy(%) when training on the subset listed on the vertical axis and testing on the subset listed along the horizontal axis.
  • Figure 4: Visualizations of t-SNE embeddings for real and fake images across five generators from GenImage. The first row presents embeddings before using LATTE (extracted using the ConvNeXt), while the second row shows embeddings derived from LATTE. The much clearer separation in the second row illustrates LATTE's discriminative power.
  • Figure 5: Accuracy(%) of LATTE vs. LaRE on perturbed images. We evaluate and compare the robustness of both methods under four common transformations: JPEG compression, center crop & resize, Gaussian blur, and noise. LATTE consistently outperforms LaRE across all perturbations.
  • ...and 5 more figures