Table of Contents
Fetching ...

Adversarially Robust Out-of-Distribution Detection Using Lyapunov-Stabilized Embeddings

Hossein Mirzaei, Mackenzie W. Mathis

TL;DR

This work tackles adversarially robust OOD detection by proposing AROS, which embeds ID and OOD data into a Lyapunov-stabilized Neural ODE framework to ensure perturbations decay toward distinct stable equilibria. AROS avoids collecting real OOD samples during training by crafting fake OOD embeddings from low-likelihood regions in the ID embedding space and leverages an orthogonal binary layer to maximize equilibrium separation. The approach yields substantial robustness gains against strong attacks (e.g., PGD and AutoAttack) across benchmarks such as CIFAR-10/100 and ImageNet, while maintaining competitive clean performance. The results demonstrate the practical impact of stability theory in neural representations for open-world detection and motivate further exploration with pretrained models and transfer learning to narrow any remaining clean-performance gap.

Abstract

Despite significant advancements in out-of-distribution (OOD) detection, existing methods still struggle to maintain robustness against adversarial attacks, compromising their reliability in critical real-world applications. Previous studies have attempted to address this challenge by exposing detectors to auxiliary OOD datasets alongside adversarial training. However, the increased data complexity inherent in adversarial training, and the myriad of ways that OOD samples can arise during testing, often prevent these approaches from establishing robust decision boundaries. To address these limitations, we propose AROS, a novel approach leveraging neural ordinary differential equations (NODEs) with Lyapunov stability theorem in order to obtain robust embeddings for OOD detection. By incorporating a tailored loss function, we apply Lyapunov stability theory to ensure that both in-distribution (ID) and OOD data converge to stable equilibrium points within the dynamical system. This approach encourages any perturbed input to return to its stable equilibrium, thereby enhancing the model's robustness against adversarial perturbations. To not use additional data, we generate fake OOD embeddings by sampling from low-likelihood regions of the ID data feature space, approximating the boundaries where OOD data are likely to reside. To then further enhance robustness, we propose the use of an orthogonal binary layer following the stable feature space, which maximizes the separation between the equilibrium points of ID and OOD samples. We validate our method through extensive experiments across several benchmarks, demonstrating superior performance, particularly under adversarial attacks. Notably, our approach improves robust detection performance from 37.8% to 80.1% on CIFAR-10 vs. CIFAR-100 and from 29.0% to 67.0% on CIFAR-100 vs. CIFAR-10.

Adversarially Robust Out-of-Distribution Detection Using Lyapunov-Stabilized Embeddings

TL;DR

This work tackles adversarially robust OOD detection by proposing AROS, which embeds ID and OOD data into a Lyapunov-stabilized Neural ODE framework to ensure perturbations decay toward distinct stable equilibria. AROS avoids collecting real OOD samples during training by crafting fake OOD embeddings from low-likelihood regions in the ID embedding space and leverages an orthogonal binary layer to maximize equilibrium separation. The approach yields substantial robustness gains against strong attacks (e.g., PGD and AutoAttack) across benchmarks such as CIFAR-10/100 and ImageNet, while maintaining competitive clean performance. The results demonstrate the practical impact of stability theory in neural representations for open-world detection and motivate further exploration with pretrained models and transfer learning to narrow any remaining clean-performance gap.

Abstract

Despite significant advancements in out-of-distribution (OOD) detection, existing methods still struggle to maintain robustness against adversarial attacks, compromising their reliability in critical real-world applications. Previous studies have attempted to address this challenge by exposing detectors to auxiliary OOD datasets alongside adversarial training. However, the increased data complexity inherent in adversarial training, and the myriad of ways that OOD samples can arise during testing, often prevent these approaches from establishing robust decision boundaries. To address these limitations, we propose AROS, a novel approach leveraging neural ordinary differential equations (NODEs) with Lyapunov stability theorem in order to obtain robust embeddings for OOD detection. By incorporating a tailored loss function, we apply Lyapunov stability theory to ensure that both in-distribution (ID) and OOD data converge to stable equilibrium points within the dynamical system. This approach encourages any perturbed input to return to its stable equilibrium, thereby enhancing the model's robustness against adversarial perturbations. To not use additional data, we generate fake OOD embeddings by sampling from low-likelihood regions of the ID data feature space, approximating the boundaries where OOD data are likely to reside. To then further enhance robustness, we propose the use of an orthogonal binary layer following the stable feature space, which maximizes the separation between the equilibrium points of ID and OOD samples. We validate our method through extensive experiments across several benchmarks, demonstrating superior performance, particularly under adversarial attacks. Notably, our approach improves robust detection performance from 37.8% to 80.1% on CIFAR-10 vs. CIFAR-100 and from 29.0% to 67.0% on CIFAR-100 vs. CIFAR-10.

Paper Structure

This paper contains 34 sections, 37 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: OOD detection performance for various models under different perturbation magnitudes. The perturbations are generated using $\text{PGD}^{1000}$ ($\ell_\infty$) attack targeting both test ID and OOD samples. (A) ImageNet is used as the ID dataset, while the Texture dataset is used as the OOD during test time. (B) CIFAR-10 is utilized as the ID, with CIFAR-100 as the OOD. (C) CIFAR-100 is used as the ID, with CIFAR-10 as the OOD. A perfect detector achieves an AUROC of 100%, a random detector scores 50%, and a fully compromised detector under attack scores 0%. Notably, no other model achieves detection performance above random (i.e., greater than 50% AUROC) at $\epsilon=\frac{8}{255}$.
  • Figure 2: An illustration of AROS. (A) To obtain robust initial features for OOD detection, we perform adversarial training on a classifier using only ID samples. (B) We estimate the ID distribution within the embedding space and generate fake OOD embeddings as a proxy for real OOD data. This enables the creation of two balanced classes of samples: ID and fake OOD. (C) The model incorporates a NODE layer $h_{\phi}$ and an Orthogonal Binary Layer $B_{\eta}$. Using these two classes, we train the pipeline with the loss function $\mathcal{L}_{\text{SL}}$ to stabilize the system dynamics. (D) During inference, an input passes through the feature extractor $f_{\theta}$, NODE $h_{\phi}$, and Orthogonal Binary Layer $B_{\eta}$, and the resulting likelihood from $B_{\eta}$ serves as the OOD score. The complete algorithmic workflow of AROS can be found in Appendix \ref{['appendix:Psudocode']}.
  • Figure 3: Visualization of clean and perturbed images from the CIFAR-10 dataset to illustrate the impact of perturbations on semantic content. The first row depicts clean images, while the second and third rows show images perturbed with $L_\infty$ norm of $\frac{8}{255}$ and $L_2$ norm of $\frac{128}{255}$, respectively. Despite the added perturbations, the semantic content of the images remains unchanged, demonstrating that robustness expectations from models under these perturbations are fair.
  • Figure 4: t-SNE visualization of CIFAR-10 embeddings and the corresponding crafted OOD embeddings for each class. Orange points represent the ID embeddings for each class, while purple points represent the synthetic OOD embeddings crafted using a GMM. The visualization highlights the separability between ID and OOD embeddings. The crafted embeddings are positioned near the boundaries of the ID concepts, emphasizing they are near OOD samples and they coverage the OOD space. The $\beta$ hyperparameter used in this experiment is set to 0.001.
  • Figure 5: Unified t-SNE visualization of embeddings for all CIFAR-10 classes and their corresponding crafted OOD embeddings. Each color represents a specific CIFAR-10 class, while the purple points represent the synthetic OOD embeddings crafted using a GMM. The figure demonstrates the clustering of ID embeddings for each class and the distinct distribution of crafted OOD embeddings. The $\beta$ hyperparameter used in this experiment is set to 0.01. Highlighting both 0.01 and 0.001 leads to crafting effective fake OOD samples, lying out of ID set.
  • ...and 4 more figures