Table of Contents
Fetching ...

HALO: Robust Out-of-Distribution Detection via Joint Optimisation

Hugo Lyons Keenan, Sarah Erfani, Christopher Leckie

TL;DR

HALO addresses robust out-of-distribution detection by jointly optimizing a classifier and an entropy-based detector under adversarial perturbations. Building on TRADES, HALO introduces a helper loss and a dual-attack training objective that covers both ID and OE data, achieving state-of-the-art AUROC across multiple OOD benchmarks and attack settings. The approach demonstrates strong clean performance, resilience to transferred and black-box attacks, and compatibility with common post-processing methods, while also highlighting scalability challenges on larger datasets. Together, these contributions advance trustworthy OOD detection in safety-critical deployments, with clear avenues for scaling and broader-domain applicability.

Abstract

Effective out-of-distribution (OOD) detection is crucial for the safe deployment of machine learning models in real-world scenarios. However, recent work has shown that OOD detection methods are vulnerable to adversarial attacks, potentially leading to critical failures in high-stakes applications. This discovery has motivated work on robust OOD detection methods that are capable of maintaining performance under various attack settings. Prior approaches have made progress on this problem but face a number of limitations: often only exhibiting robustness to attacks on OOD data or failing to maintain strong clean performance. In this work, we adapt an existing robust classification framework, TRADES, extending it to the problem of robust OOD detection and discovering a novel objective function. Recognising the critical importance of a strong clean/robust trade-off for OOD detection, we introduce an additional loss term which boosts classification and detection performance. Our approach, called HALO (Helper-based AdversariaL OOD detection), surpasses existing methods and achieves state-of-the-art performance across a number of datasets and attack settings. Extensive experiments demonstrate an average AUROC improvement of 3.15 in clean settings and 7.07 under adversarial attacks when compared to the next best method. Furthermore, HALO exhibits resistance to transferred attacks, offers tuneable performance through hyperparameter selection, and is compatible with existing OOD detection frameworks out-of-the-box, leaving open the possibility of future performance gains. Code is available at: https://github.com/hugo0076/HALO

HALO: Robust Out-of-Distribution Detection via Joint Optimisation

TL;DR

HALO addresses robust out-of-distribution detection by jointly optimizing a classifier and an entropy-based detector under adversarial perturbations. Building on TRADES, HALO introduces a helper loss and a dual-attack training objective that covers both ID and OE data, achieving state-of-the-art AUROC across multiple OOD benchmarks and attack settings. The approach demonstrates strong clean performance, resilience to transferred and black-box attacks, and compatibility with common post-processing methods, while also highlighting scalability challenges on larger datasets. Together, these contributions advance trustworthy OOD detection in safety-critical deployments, with clear avenues for scaling and broader-domain applicability.

Abstract

Effective out-of-distribution (OOD) detection is crucial for the safe deployment of machine learning models in real-world scenarios. However, recent work has shown that OOD detection methods are vulnerable to adversarial attacks, potentially leading to critical failures in high-stakes applications. This discovery has motivated work on robust OOD detection methods that are capable of maintaining performance under various attack settings. Prior approaches have made progress on this problem but face a number of limitations: often only exhibiting robustness to attacks on OOD data or failing to maintain strong clean performance. In this work, we adapt an existing robust classification framework, TRADES, extending it to the problem of robust OOD detection and discovering a novel objective function. Recognising the critical importance of a strong clean/robust trade-off for OOD detection, we introduce an additional loss term which boosts classification and detection performance. Our approach, called HALO (Helper-based AdversariaL OOD detection), surpasses existing methods and achieves state-of-the-art performance across a number of datasets and attack settings. Extensive experiments demonstrate an average AUROC improvement of 3.15 in clean settings and 7.07 under adversarial attacks when compared to the next best method. Furthermore, HALO exhibits resistance to transferred attacks, offers tuneable performance through hyperparameter selection, and is compatible with existing OOD detection frameworks out-of-the-box, leaving open the possibility of future performance gains. Code is available at: https://github.com/hugo0076/HALO

Paper Structure

This paper contains 52 sections, 21 equations, 11 figures, 16 tables, 3 algorithms.

Figures (11)

  • Figure 1: A visualisation of two kinds of detection attack where a CIFAR-10 image (a horse) is ID and a MNIST image (a handwritten four) is OOD. $f$ is the classifier and $g$ the OOD detector. Subfigures left to right: (a) A clean ID sample is passed to the classifier and assigned a label; (b) A clean OOD sample is detected as OOD and rejected; (c) An ID$\rightarrow$OOD attack where the attacked ID sample is detected as OOD and erroneously rejected; and (d) An OOD$\rightarrow$ID attack that causes an OOD sample to evade detection and erroneously be assigned an ID label.
  • Figure 2: Model detection performance (AUROC) in both standard (solid lines) and attacked (dashed lines) settings for different datasets. We compare ALOE chen2022robust, OSAD shao2022open, TRADES zhang2019theoretically, ATD azizmalayeri2022your and our method, HALO. Our method achieves the strongest clean and robust performance over a range of different datasets.
  • Figure 3: Two-dimensional toy model showing the effect of different adversarial training methods. Solid lines represent training data distribution, dashed lines represent allowable perturbations. Training methods left to right: (a) Standard training with OE, (b) Adversarial training on ID data, (c) Adversarial training on OE data, (d) Adversarial training on both ID and OE data. Figures top to bottom: entropy distribution of possible inputs, classification decision boundary, detection decision boundary. Only model (d) that trains on both types of attack develops robust classification and detection decision boundaries.
  • Figure 4: Hyperparameter sensitivity analysis on CIFAR-10. By default we use $\eta=2.0$, $\gamma=0.5$ and $\beta = 3.0$ and only change the metric being examined. All results are averages over 3 independent runs. Top row: averaged AUROC across 6 datasets for different attack settings, bottom row: clean accuracy (left axis) and robust accuracy (right axis). Columns: varying $\eta$ (left), $\gamma$ (middle) and $\beta$ (right).
  • Figure 5: AUROC scores over various values of $\beta_1$ and $\beta_2$, under both ID$\rightarrow$OOD and OOD$\rightarrow$ID attacks. We report the average over 3 runs, with the standard deviation in brackets. Different relative values of each hyperparameter lead to different robustness profiles.
  • ...and 6 more figures