Table of Contents
Fetching ...

Are Deep Learning Models Robust to Partial Object Occlusion in Visual Recognition Tasks?

Kaleb Kassaw, Francesco Luzi, Leslie M. Collins, Jordan M. Malof

TL;DR

It is found that modern CNN-based models show improved recognition accuracy on occluded images compared to earlier CNN-based models, and ViT-based models are more accurate than CNN-based models on occluded images, performing only modestly worse than human accuracy.

Abstract

Image classification models, including convolutional neural networks (CNNs), perform well on a variety of classification tasks but struggle under conditions of partial occlusion, i.e., conditions in which objects are partially covered from the view of a camera. Methods to improve performance under occlusion, including data augmentation, part-based clustering, and more inherently robust architectures, including Vision Transformer (ViT) models, have, to some extent, been evaluated on their ability to classify objects under partial occlusion. However, evaluations of these methods have largely relied on images containing artificial occlusion, which are typically computer-generated and therefore inexpensive to label. Additionally, methods are rarely compared against each other, and many methods are compared against early, now outdated, deep learning models. We contribute the Image Recognition Under Occlusion (IRUO) dataset, based on the recently developed Occluded Video Instance Segmentation (OVIS) dataset (arXiv:2102.01558). IRUO utilizes real-world and artificially occluded images to test and benchmark leading methods' robustness to partial occlusion in visual recognition tasks. In addition, we contribute the design and results of a human study using images from IRUO that evaluates human classification performance at multiple levels and types of occlusion. We find that modern CNN-based models show improved recognition accuracy on occluded images compared to earlier CNN-based models, and ViT-based models are more accurate than CNN-based models on occluded images, performing only modestly worse than human accuracy. We also find that certain types of occlusion, including diffuse occlusion, where relevant objects are seen through "holes" in occluders such as fences and leaves, can greatly reduce the accuracy of deep recognition models as compared to humans, especially those with CNN backbones.

Are Deep Learning Models Robust to Partial Object Occlusion in Visual Recognition Tasks?

TL;DR

It is found that modern CNN-based models show improved recognition accuracy on occluded images compared to earlier CNN-based models, and ViT-based models are more accurate than CNN-based models on occluded images, performing only modestly worse than human accuracy.

Abstract

Image classification models, including convolutional neural networks (CNNs), perform well on a variety of classification tasks but struggle under conditions of partial occlusion, i.e., conditions in which objects are partially covered from the view of a camera. Methods to improve performance under occlusion, including data augmentation, part-based clustering, and more inherently robust architectures, including Vision Transformer (ViT) models, have, to some extent, been evaluated on their ability to classify objects under partial occlusion. However, evaluations of these methods have largely relied on images containing artificial occlusion, which are typically computer-generated and therefore inexpensive to label. Additionally, methods are rarely compared against each other, and many methods are compared against early, now outdated, deep learning models. We contribute the Image Recognition Under Occlusion (IRUO) dataset, based on the recently developed Occluded Video Instance Segmentation (OVIS) dataset (arXiv:2102.01558). IRUO utilizes real-world and artificially occluded images to test and benchmark leading methods' robustness to partial occlusion in visual recognition tasks. In addition, we contribute the design and results of a human study using images from IRUO that evaluates human classification performance at multiple levels and types of occlusion. We find that modern CNN-based models show improved recognition accuracy on occluded images compared to earlier CNN-based models, and ViT-based models are more accurate than CNN-based models on occluded images, performing only modestly worse than human accuracy. We also find that certain types of occlusion, including diffuse occlusion, where relevant objects are seen through "holes" in occluders such as fences and leaves, can greatly reduce the accuracy of deep recognition models as compared to humans, especially those with CNN backbones.
Paper Structure (25 sections, 3 equations, 20 figures, 5 tables)

This paper contains 25 sections, 3 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Images from Microsoft's COCO dataset (lin_microsoft_2015), a popular dataset for visual recognition tasks including object detection, demonstrating partial object occlusion. In each of these images, the object class "cat" is partially occluded by (a) a suitcase, among the object classes listed in COCO, (b) another cat, and (c) a boot, a class of object not among the object clases listed in COCO
  • Figure 2: Images showing various levels of real occlusion, increasing from left to right, in the Image Recognition Under Occlusion (IRUO) dataset. We propose to use images including these for evaluation of models’ robustness to occlusion in recognition tasks, owing to realistic occlusions and a wide diversity of scenes. Occlusion levels: (a) 0: no occlusion, (b) 1: some occlusion, in which up to 50 percent of the object is hidden from view, (c) 2: severe occlusion, in which more than 50 percent of an object is hidden from view
  • Figure 3: Diagram showing the data generation process of all dataset partitions included in IRUO. A description of each partition is found in Table \ref{['tab:dataset_details']}
  • Figure 4: Top row, (a), (b): sample images kept by blur filter algorithm; bottom row, (c), (d): sample images rejected by blur filter algorithm. The blur filter parameters used are a minimum Laplacian variance of 20 and minimum image size of 10,000 pixels
  • Figure 5: Hierarchy of classes in IRUO. In Sec. \ref{['sec:human_study']} and Fig. \ref{['fig:top_humans']}, results are reported according to classification level, or the tree depth reported for each class on this tree; e.g., the class "zebra" is located at level 4, and the class "fish" is located at level 2. Note that classes that appear to be subclasses of themselves refer to classes in which there is an "other" category, e.g., the subset "cat" refers to superset "cat" objects that are not "tiger" objects
  • ...and 15 more figures