Table of Contents
Fetching ...

Language-Assisted Feature Transformation for Anomaly Detection

EungGu Yun, Heonjin Ha, Yeongwoo Nam, Bryan Dongik Lee

TL;DR

This work addresses the challenge of defining a flexible normality boundary for anomaly detection under limited or biased data. It introduces Language-Assisted Feature Transformation (LAFT), a training-free approach that leverages the CLIP embedding space to build concept axes from text prompts and project visual features accordingly, enabling targeted or suppressed attributes via $v' = T(v)$ or its orthogonal variant. By combining LAFT with a $k$-NN anomaly scorer (LAFT AD) and integrating LAFT into WinCLIP for industrial AD (WinCLIP+LAFT), the method achieves strong semantic and industrial anomaly detection performance without additional training data. The approach demonstrates robustness to prompt quality, improves detection of anomalies aligned with user knowledge, and offers practical impact in settings where domain knowledge is available but labeled anomalies are scarce. Limitations include heuristic selection of the PCA dimension $d$ and limited localization improvements, suggesting avenues for automatic dimension selection and finer-grained modeling in future work.

Abstract

This paper introduces LAFT, a novel feature transformation method designed to incorporate user knowledge and preferences into anomaly detection using natural language. Accurately modeling the boundary of normality is crucial for distinguishing abnormal data, but this is often challenging due to limited data or the presence of nuisance attributes. While unsupervised methods that rely solely on data without user guidance are common, they may fail to detect anomalies of specific interest. To address this limitation, we propose Language-Assisted Feature Transformation (LAFT), which leverages the shared image-text embedding space of vision-language models to transform visual features according to user-defined requirements. Combined with anomaly detection methods, LAFT effectively aligns visual features with user preferences, allowing anomalies of interest to be detected. Extensive experiments on both toy and real-world datasets validate the effectiveness of our method.

Language-Assisted Feature Transformation for Anomaly Detection

TL;DR

This work addresses the challenge of defining a flexible normality boundary for anomaly detection under limited or biased data. It introduces Language-Assisted Feature Transformation (LAFT), a training-free approach that leverages the CLIP embedding space to build concept axes from text prompts and project visual features accordingly, enabling targeted or suppressed attributes via or its orthogonal variant. By combining LAFT with a -NN anomaly scorer (LAFT AD) and integrating LAFT into WinCLIP for industrial AD (WinCLIP+LAFT), the method achieves strong semantic and industrial anomaly detection performance without additional training data. The approach demonstrates robustness to prompt quality, improves detection of anomalies aligned with user knowledge, and offers practical impact in settings where domain knowledge is available but labeled anomalies are scarce. Limitations include heuristic selection of the PCA dimension and limited localization improvements, suggesting avenues for automatic dimension selection and finer-grained modeling in future work.

Abstract

This paper introduces LAFT, a novel feature transformation method designed to incorporate user knowledge and preferences into anomaly detection using natural language. Accurately modeling the boundary of normality is crucial for distinguishing abnormal data, but this is often challenging due to limited data or the presence of nuisance attributes. While unsupervised methods that rely solely on data without user guidance are common, they may fail to detect anomalies of specific interest. To address this limitation, we propose Language-Assisted Feature Transformation (LAFT), which leverages the shared image-text embedding space of vision-language models to transform visual features according to user-defined requirements. Combined with anomaly detection methods, LAFT effectively aligns visual features with user preferences, allowing anomalies of interest to be detected. Extensive experiments on both toy and real-world datasets validate the effectiveness of our method.

Paper Structure

This paper contains 36 sections, 7 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: High-level motivation of our method: (left) typical image anomaly detection methods treat all test data that differs from the training data as anomalies, while (right) our method, LAFT AD, incorporates user preferences into the anomaly detection.
  • Figure 2: Overview of our method, LAFT, a transformation module, and LAFT AD, combining LAFT with a $k$NN classifier. Our approach uses CLIP's text and image encoders without any additional training. The key idea is to use text prompts containing concept values to construct a concept subspace for the target attribute. This process involves computing pairwise differences of concept prototypes and extracting robust concept axes via PCA. Once the concept subspaces are created, the shared embedding space can be used to transform image features suitable for anomaly detection.
  • Figure 3: Projection of image features from CLIP's image encoder (left) and transformed image features using LAFT (right). Without guidance, the image features may not align with the intended attributes. After applying LAFT, the features become more aligned with the desired attributes.