Table of Contents
Fetching ...

Image Outlier Detection Without Training using RANSAC

Chen-Han Tsai, Yu-Shao Peng

TL;DR

This paper tackles image outlier detection when training data may be contaminated by outliers. It introduces RANSAC-NN, a training-free algorithm that uses a two-stage process—Inlier Score Prediction (ISP) and Threshold Sampling (TS)—to quantify outliers directly from data distributions via sub-sampling and cosine similarity in embedding space. The method shows competitive performance against trained OD models on natural image benchmarks and demonstrates robustness to contamination, plus the ability to improve existing OD methods when used as a data-cleaning step. It also provides guidance on hyperparameters (m,s,τ,t) and shows consistent behavior across different feature extractors, with practical implications for mislabeled detection tasks. Overall, RANSAC-NN offers a practical, training-free alternative for robust image OD and as a preprocessing tool to bolster downstream models.

Abstract

Image outlier detection (OD) is an essential tool to ensure the quality of images used in computer vision tasks. Existing algorithms often involve training a model to represent the inlier distribution, and outliers are determined by some deviation measure. Although existing methods proved effective when trained on strictly inlier samples, their performance remains questionable when undesired outliers are included during training. As a result of this limitation, it is necessary to carefully examine the data when developing OD models for new domains. In this work, we present a novel image OD algorithm called RANSAC-NN that eliminates the need of data examination and model training altogether. Unlike existing approaches, RANSAC-NN can be directly applied on datasets containing outliers by sampling and comparing subsets of the data. Our algorithm maintains favorable performance compared to existing methods on a range of benchmarks. Furthermore, we show that RANSAC-NN can enhance the robustness of existing methods by incorporating our algorithm as part of the data preparation process.

Image Outlier Detection Without Training using RANSAC

TL;DR

This paper tackles image outlier detection when training data may be contaminated by outliers. It introduces RANSAC-NN, a training-free algorithm that uses a two-stage process—Inlier Score Prediction (ISP) and Threshold Sampling (TS)—to quantify outliers directly from data distributions via sub-sampling and cosine similarity in embedding space. The method shows competitive performance against trained OD models on natural image benchmarks and demonstrates robustness to contamination, plus the ability to improve existing OD methods when used as a data-cleaning step. It also provides guidance on hyperparameters (m,s,τ,t) and shows consistent behavior across different feature extractors, with practical implications for mislabeled detection tasks. Overall, RANSAC-NN offers a practical, training-free alternative for robust image OD and as a preprocessing tool to bolster downstream models.

Abstract

Image outlier detection (OD) is an essential tool to ensure the quality of images used in computer vision tasks. Existing algorithms often involve training a model to represent the inlier distribution, and outliers are determined by some deviation measure. Although existing methods proved effective when trained on strictly inlier samples, their performance remains questionable when undesired outliers are included during training. As a result of this limitation, it is necessary to carefully examine the data when developing OD models for new domains. In this work, we present a novel image OD algorithm called RANSAC-NN that eliminates the need of data examination and model training altogether. Unlike existing approaches, RANSAC-NN can be directly applied on datasets containing outliers by sampling and comparing subsets of the data. Our algorithm maintains favorable performance compared to existing methods on a range of benchmarks. Furthermore, we show that RANSAC-NN can enhance the robustness of existing methods by incorporating our algorithm as part of the data preparation process.
Paper Structure (36 sections, 5 equations, 9 figures, 6 tables, 2 algorithms)

This paper contains 36 sections, 5 equations, 9 figures, 6 tables, 2 algorithms.

Figures (9)

  • Figure 1: Performance Drop from Contaminated Training. Shown above are the performance of several OD algorithms when trained on a contaminated set. At low contamination levels, the drop in performance is already visible in some algorithms. As the contamination level increases, the performance drop is evident throughout all algorithms. In contrast to the uncontaminated setting, where the performance was relatively constant, this performance difference highlights the importance of maintaining a quality dataset when training OD models. Since RANSAC-NN does not require training, its performance is not influenced by contaminated training.
  • Figure 2: Outlier Filtering with RANSAC-NN. Shown above is the outlier score distribution predicted by RANSAC-NN on ImageNet21K data-imagenet21k with $20\%$ perturbation. By taking the top-$p$ percent of images with the lowest outlier scores, we can easily remove large amounts of outliers. However, a smaller value of $p$ implies less inlier images available for training. In this example, $p=50$ yields lower selectivity than $p=80$ according to the Matthew's Correlation Coefficient mcc_advantage.
  • Figure 3: Improvements from Threshold Sampling. Plotted above is the performance improvements from applying Threshold Sampling (TS) in comparison to using the inverted inlier scores from ISP. Notice how applying TS improves the robustness of RANSAC-NN especially in high perturbation settings.
  • Figure 4: Sample Size and Sampling Iterations. Shown above are the RANSAC-NN performance under different sample sizes (color intensity) and sampling iterations ($x$-axis). Notice how a large sample size requires significantly more sampling iterations to achieve similar performance obtained by smaller sample sizes. However, an extremely small sample size should be avoided (see Section \ref{['methods:sample_size_iteration_properties']}). With a reasonable sample size, larger sampling iterations often result in better performance.
  • Figure 5: Influence of Contaminated Training (MobileNet).
  • ...and 4 more figures