Table of Contents
Fetching ...

Unsupervised Layer-wise Score Aggregation for Textual OOD Detection

Maxime Darrin, Guillaume Staerman, Eduardo Dadalto Câmara Gomes, Jackie CK Cheung, Pablo Piantanida, Pierre Colombo

TL;DR

This work tackles textual OOD detection by showing that conventional last-layer scores are not reliably optimal across tasks. It introduces a data-driven, unsupervised framework to aggregate layer-wise anomaly scores from all encoder layers, using no-reference and reference-based approaches with per-class detectors, and demonstrates substantial performance gains on the MILTOOD-C benchmark. MILTOOD-C extends evaluation to multilingual, high-class settings, revealing strong robustness of the proposed aggregation across languages, tasks, and architectures and occasionally surpassing an oracle that knows the best layer. The results highlight that valuable OOD-discriminative information is distributed throughout the encoder, enabling more reliable deployment of NLP systems in real-world scenarios.

Abstract

Out-of-distribution (OOD) detection is a rapidly growing field due to new robustness and security requirements driven by an increased number of AI-based systems. Existing OOD textual detectors often rely on an anomaly score (e.g., Mahalanobis distance) computed on the embedding output of the last layer of the encoder. In this work, we observe that OOD detection performance varies greatly depending on the task and layer output. More importantly, we show that the usual choice (the last layer) is rarely the best one for OOD detection and that far better results could be achieved if the best layer were picked. To leverage this observation, we propose a data-driven, unsupervised method to combine layer-wise anomaly scores. In addition, we extend classical textual OOD benchmarks by including classification tasks with a greater number of classes (up to 77), which reflects more realistic settings. On this augmented benchmark, we show that the proposed post-aggregation methods achieve robust and consistent results while removing manual feature selection altogether. Their performance achieves near oracle's best layer performance.

Unsupervised Layer-wise Score Aggregation for Textual OOD Detection

TL;DR

This work tackles textual OOD detection by showing that conventional last-layer scores are not reliably optimal across tasks. It introduces a data-driven, unsupervised framework to aggregate layer-wise anomaly scores from all encoder layers, using no-reference and reference-based approaches with per-class detectors, and demonstrates substantial performance gains on the MILTOOD-C benchmark. MILTOOD-C extends evaluation to multilingual, high-class settings, revealing strong robustness of the proposed aggregation across languages, tasks, and architectures and occasionally surpassing an oracle that knows the best layer. The results highlight that valuable OOD-discriminative information is distributed throughout the encoder, enabling more reliable deployment of NLP systems in real-world scenarios.

Abstract

Out-of-distribution (OOD) detection is a rapidly growing field due to new robustness and security requirements driven by an increased number of AI-based systems. Existing OOD textual detectors often rely on an anomaly score (e.g., Mahalanobis distance) computed on the embedding output of the last layer of the encoder. In this work, we observe that OOD detection performance varies greatly depending on the task and layer output. More importantly, we show that the usual choice (the last layer) is rarely the best one for OOD detection and that far better results could be achieved if the best layer were picked. To leverage this observation, we propose a data-driven, unsupervised method to combine layer-wise anomaly scores. In addition, we extend classical textual OOD benchmarks by including classification tasks with a greater number of classes (up to 77), which reflects more realistic settings. On this augmented benchmark, we show that the proposed post-aggregation methods achieve robust and consistent results while removing manual feature selection altogether. Their performance achieves near oracle's best layer performance.
Paper Structure (39 sections, 8 equations, 9 figures, 23 tables)

This paper contains 39 sections, 8 equations, 9 figures, 23 tables.

Figures (9)

  • Figure 1: OOD detection performance in terms of AUROC$\uparrow$ for each features-based OOD score (Mahalanobis distance ($s_M$), Maximum cosine similarity ($s_C$) and IRW ($s_{IRW}$)) computed at each layer of the encoder for different OOD datasets for a model fine-tuned on SST2. We observe that the performance of each metric on each layer varies significantly with the OOD task and that OOD detection based on the last layer (dark dotted line) rarely yields the best results.
  • Figure 2: Schema of our aggregation procedure. (1) We extract the embeddings at each layer of the encoder for every sample. (2) We compute the per-class scores for a reference set and the new sample to be evaluated for each layer embedding. (3) We aggregate the scores over every layer to get an aggregated per-class score before taking the min score over the classes. (4) Finally, we apply the threshold on this minimum.
  • Figure 3: Average performance of OOD detectors in terms of AUROC$\uparrow$ for tasks involving different numbers of classes.
  • Figure 4: Stability and robustness comparison of the best-performing aggregation methods and underlying OOD scores with $S_C$ as underlying OOD score. Common baselines and SOTA display significant deviations in performance with the different languages, whereas score aggregation methods induce more consistent and better performance.
  • Figure 5: Average performance difference in terms of AUROC between aggregation methods and the oracle (best possible layer).
  • ...and 4 more figures

Theorems & Definitions (4)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4