Table of Contents
Fetching ...

Spatial Autoregressive Modeling of DINOv3 Embeddings for Unsupervised Anomaly Detection

Ertunc Erdil, Nico Schulthess, Guney Tombak, Ender Konukoglu

TL;DR

Experimental results demonstrate that explicitly modeling spatial dependencies achieves competitive anomaly detection performance while substantially reducing inference time and memory requirements.

Abstract

DINO models provide rich patch-level representations that have recently enabled strong performance in unsupervised anomaly detection (UAD). Most existing methods extract patch embeddings from ``normal'' images and model them independently, ignoring spatial and neighborhood relationships between patches. This implicitly assumes that self-attention and positional encodings sufficiently encode contextual information within each patch embedding. In addition, the normative distribution is often modeled as memory banks or prototype-based representations, which require storing large numbers of features and performing costly comparisons at inference time, leading to substantial memory and computational overhead. In this work, we address both limitations by proposing a simple and efficient framework that explicitly models spatial and contextual dependencies between patch embeddings using a 2D autoregressive (AR) model. Instead of storing embeddings or clustering prototypes, our approach learns a compact parametric model of the normative distribution via an AR convolutional neural network (CNN). At test time, anomaly detection reduces to a single forward pass through the network and enables fast and memory-efficient inference. We evaluate our method on the BMAD benchmark, which comprises three medical imaging datasets, and compare it against existing work including recent DINO-based methods. Experimental results demonstrate that explicitly modeling spatial dependencies achieves competitive anomaly detection performance while substantially reducing inference time and memory requirements. Code is available at the project page: https://eerdil.github.io/spatial-ar-dinov3-uad/.

Spatial Autoregressive Modeling of DINOv3 Embeddings for Unsupervised Anomaly Detection

TL;DR

Experimental results demonstrate that explicitly modeling spatial dependencies achieves competitive anomaly detection performance while substantially reducing inference time and memory requirements.

Abstract

DINO models provide rich patch-level representations that have recently enabled strong performance in unsupervised anomaly detection (UAD). Most existing methods extract patch embeddings from ``normal'' images and model them independently, ignoring spatial and neighborhood relationships between patches. This implicitly assumes that self-attention and positional encodings sufficiently encode contextual information within each patch embedding. In addition, the normative distribution is often modeled as memory banks or prototype-based representations, which require storing large numbers of features and performing costly comparisons at inference time, leading to substantial memory and computational overhead. In this work, we address both limitations by proposing a simple and efficient framework that explicitly models spatial and contextual dependencies between patch embeddings using a 2D autoregressive (AR) model. Instead of storing embeddings or clustering prototypes, our approach learns a compact parametric model of the normative distribution via an AR convolutional neural network (CNN). At test time, anomaly detection reduces to a single forward pass through the network and enables fast and memory-efficient inference. We evaluate our method on the BMAD benchmark, which comprises three medical imaging datasets, and compare it against existing work including recent DINO-based methods. Experimental results demonstrate that explicitly modeling spatial dependencies achieves competitive anomaly detection performance while substantially reducing inference time and memory requirements. Code is available at the project page: https://eerdil.github.io/spatial-ar-dinov3-uad/.
Paper Structure (15 sections, 5 equations, 2 figures, 1 table)

This paper contains 15 sections, 5 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Illustration of (a) raster-scan autoregressive factorization over the embedding grid where the blue cell indicates the current prediction target, green cells denote preceding (conditioned) embeddings, and red cells correspond to future spatial locations which are treated as unobserved to predict the blue embedding, (b) Standard and masked convolutional kernels without dilation, and (c) corresponding dilated convolutional kernels. Grey cells indicate masked weights that prevent access to future spatial positions.
  • Figure 2: (top) Runtime vs. AUROC and AUPR trade-off across datasets. Methods located in the upper-left region (high detection performance, low runtime) are preferable. (bottom) Comparison of AUROC and AUPR scores over all datasets and runtime and memory consumption measurements on the RESC dataset. AUROC and AUPR scores are averaged over three random seeds and we report mean $\pm$ std. Methods using DINO are highlighted in gray.