Table of Contents
Fetching ...

Anomalies by Synthesis: Anomaly Detection using Generative Diffusion Models for Off-Road Navigation

Siddharth Ancha, Sunshine Jiang, Travis Manderson, Laura Brandt, Yilun Du, Philip R. Osteen, Nicholas Roy

TL;DR

This work tackles robust anomaly detection for off-road navigation by reframing it as a post-hoc analysis-by-synthesis problem. It uses a diffusion model trained on in-distribution data to edit the input image, removing anomalies while preserving non-OOd content, and then detects anomalous regions by comparing the input and edited images in a semantically rich feature space. A principled guided-diffusion mechanism based on an ideal and a tractable approximation of the guidance gradient enables edit-focused sampling without retraining. The pipeline combines MaskCLIP/FeatUp and SAM to produce accurate, pixel-wise anomaly maps, and demonstrates strong gains on RUGD and REllis-3D datasets, with qualitative underwater results, illustrating practical impact for autonomous off-road systems.

Abstract

In order to navigate safely and reliably in off-road and unstructured environments, robots must detect anomalies that are out-of-distribution (OOD) with respect to the training data. We present an analysis-by-synthesis approach for pixel-wise anomaly detection without making any assumptions about the nature of OOD data. Given an input image, we use a generative diffusion model to synthesize an edited image that removes anomalies while keeping the remaining image unchanged. Then, we formulate anomaly detection as analyzing which image segments were modified by the diffusion model. We propose a novel inference approach for guided diffusion by analyzing the ideal guidance gradient and deriving a principled approximation that bootstraps the diffusion model to predict guidance gradients. Our editing technique is purely test-time that can be integrated into existing workflows without the need for retraining or fine-tuning. Finally, we use a combination of vision-language foundation models to compare pixels in a learned feature space and detect semantically meaningful edits, enabling accurate anomaly detection for off-road navigation. Project website: https://siddancha.github.io/anomalies-by-diffusion-synthesis/

Anomalies by Synthesis: Anomaly Detection using Generative Diffusion Models for Off-Road Navigation

TL;DR

This work tackles robust anomaly detection for off-road navigation by reframing it as a post-hoc analysis-by-synthesis problem. It uses a diffusion model trained on in-distribution data to edit the input image, removing anomalies while preserving non-OOd content, and then detects anomalous regions by comparing the input and edited images in a semantically rich feature space. A principled guided-diffusion mechanism based on an ideal and a tractable approximation of the guidance gradient enables edit-focused sampling without retraining. The pipeline combines MaskCLIP/FeatUp and SAM to produce accurate, pixel-wise anomaly maps, and demonstrates strong gains on RUGD and REllis-3D datasets, with qualitative underwater results, illustrating practical impact for autonomous off-road systems.

Abstract

In order to navigate safely and reliably in off-road and unstructured environments, robots must detect anomalies that are out-of-distribution (OOD) with respect to the training data. We present an analysis-by-synthesis approach for pixel-wise anomaly detection without making any assumptions about the nature of OOD data. Given an input image, we use a generative diffusion model to synthesize an edited image that removes anomalies while keeping the remaining image unchanged. Then, we formulate anomaly detection as analyzing which image segments were modified by the diffusion model. We propose a novel inference approach for guided diffusion by analyzing the ideal guidance gradient and deriving a principled approximation that bootstraps the diffusion model to predict guidance gradients. Our editing technique is purely test-time that can be integrated into existing workflows without the need for retraining or fine-tuning. Finally, we use a combination of vision-language foundation models to compare pixels in a learned feature space and detect semantically meaningful edits, enabling accurate anomaly detection for off-road navigation. Project website: https://siddancha.github.io/anomalies-by-diffusion-synthesis/

Paper Structure

This paper contains 19 sections, 6 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Our proposed pipeline for pixel-wise anomaly detection. Left to right: In the synthesis step, a trained diffusion model edits a given input image to remove anomaly segments without modifying other parts of the image. In this case, the model blends the OOD vehicle into dirt in the background. The analysis step extracts anomalies by comparing the pair of images in the CLIP radford2021clip feature space. First, MaskCLIP dong2023maskclip computes low-resolution CLIP features for each image, which are upsampled using FeatUp fu2024featup. In this figure, features are visualized via a t-SNE projection to three dimensions. Cosine distances between pixel features in the two images produce a raw anomaly map that highlights anomaly objects. In contrast, comparing the images directly in RGB space (extreme right) is noisy and unable to isolate OOD segments. Finally, SAM kirillov2023segment processes the input image to generate segments; these are used to refine and clean the anomaly map.
  • Figure 2: Probabilistic graphical model for the conditional forward diffusion process. The target variable we wish to sample is $\mathbf{x}_0$. Directed edges correspond to the standard forward diffusion process. The unnormalized factor $r_\mathrm{sim}(\mathbf{x}_0, \mathbf{x}_0^\mathrm{input})$ conditions $\mathbf{x}_0$ to be similar to the (fixed) input image.
  • Figure 3: Probabilistic graphical model of the conditional forward diffusion process. We wish to sample the random variable $\mathbf{x}_0$ corresponding to the training data distribution $q(\mathbf{x}_0)$. The directed edges between $\mathbf{x}_{t-1}$ and $\mathbf{x}_t$ (for $t = 1, \dots, T$) correspond to the vanilla forward diffusion process. Each directed edge denotes the sampling distribution $q(\mathbf{x}_t \,\vert\, \mathbf{x}_{t-1})$ which successively adds a small amounts of Gaussian noise: $q(\mathbf{x}_t \,\vert\, \mathbf{x}_{t}) = \mathcal{N}\left( \mathbf{x}_t\,;\,\sqrt{1-\beta_t}\,\mathbf{x}_{t-1}, \beta_t \bf{I} \right)$sohl2015deepho2020denoising. However, we are interested in sampling from $q(\mathbf{x}_0 \,\vert\, \mathbf{x}_0^\mathrm{input}) = q(\mathbf{x}_0),r_\mathrm{sim}(\mathbf{x}_0, \mathbf{x}_0^\mathrm{input})$. This objective corresponds to adding an additional undirected factor$r_\mathrm{sim}(\mathbf{x}_0, \mathbf{x}_0^\mathrm{input})$ between $\mathbf{x}_0$ and $\mathbf{x}_0^\mathrm{input}$; $\mathbf{x}_0^\mathrm{input}$ is treated as constant. Our task is to perform inference over this graphical model and sample from $q(\mathbf{x}_0 \,\vert\, \mathbf{x}_0^\mathrm{input})$ using a diffusion model that was trained to perform reverse diffusion in the absence of the $r_\mathrm{sim}$ factor.
  • Figure 4: Examples of video frames, annotations and semantic classes from the full RUGD dataset RUGD.
  • Figure 5: In-distribution and out-of-distribution images from the RUGD dataset.Left: Examples of the in-distribution images on which our RUGD diffusion model was trained. In general, these images contain natural, off-road vegetation --- a mixture of forest, meadow, mulch, and paths, without any humans or artificial constructions like buildings or vehicles. Right: Examples of held-out, out-of-distribution RUGD images the robot might encounter. These contains anomaly objects like buildings and vehicles. The diffusion model trained on the images on the left must remove anomalies from the images on the right.
  • ...and 2 more figures