Table of Contents
Fetching ...

D-Feat Occlusions: Diffusion Features for Robustness to Partial Visual Occlusions in Object Recognition

Rupayan Mallick, Sibo Dong, Nataniel Ruiz, Sarah Adel Bargal

TL;DR

This work tackles the challenge of occlusion robustness in object recognition by leveraging a frozen diffusion model to both inpaint occluded regions and extract diffusion-based embedding features. It introduces two augmentation strategies—input-space diffusion inpainting and embedding-space diffusion features—and a real-world occlusion dataset, D-feat, to evaluate performance on Transformer and ConvNet backbones. Empirical results show diffusion-based augmentations outperform traditional baselines across simulated occlusions and substantially improve performance on real occlusions, with diffusion features offering efficiency advantages. The study suggests diffusion-driven augmentation as a practical approach to enhance robustness in high-stakes vision systems, such as autonomous vehicles, under partial visibility.

Abstract

Applications of diffusion models for visual tasks have been quite noteworthy. This paper targets making classification models more robust to occlusions for the task of object recognition by proposing a pipeline that utilizes a frozen diffusion model. Diffusion features have demonstrated success in image generation and image completion while understanding image context. Occlusion can be posed as an image completion problem by deeming the pixels of the occluder to be `missing.' We hypothesize that such features can help hallucinate object visual features behind occluding objects, and hence we propose using them to enable models to become more occlusion robust. We design experiments to include input-based augmentations as well as feature-based augmentations. Input-based augmentations involve finetuning on images where the occluder pixels are inpainted, and feature-based augmentations involve augmenting classification features with intermediate diffusion features. We demonstrate that our proposed use of diffusion-based features results in models that are more robust to partial object occlusions for both Transformers and ConvNets on ImageNet with simulated occlusions. We also propose a dataset that encompasses real-world occlusions and demonstrate that our method is more robust to partial object occlusions.

D-Feat Occlusions: Diffusion Features for Robustness to Partial Visual Occlusions in Object Recognition

TL;DR

This work tackles the challenge of occlusion robustness in object recognition by leveraging a frozen diffusion model to both inpaint occluded regions and extract diffusion-based embedding features. It introduces two augmentation strategies—input-space diffusion inpainting and embedding-space diffusion features—and a real-world occlusion dataset, D-feat, to evaluate performance on Transformer and ConvNet backbones. Empirical results show diffusion-based augmentations outperform traditional baselines across simulated occlusions and substantially improve performance on real occlusions, with diffusion features offering efficiency advantages. The study suggests diffusion-driven augmentation as a practical approach to enhance robustness in high-stakes vision systems, such as autonomous vehicles, under partial visibility.

Abstract

Applications of diffusion models for visual tasks have been quite noteworthy. This paper targets making classification models more robust to occlusions for the task of object recognition by proposing a pipeline that utilizes a frozen diffusion model. Diffusion features have demonstrated success in image generation and image completion while understanding image context. Occlusion can be posed as an image completion problem by deeming the pixels of the occluder to be `missing.' We hypothesize that such features can help hallucinate object visual features behind occluding objects, and hence we propose using them to enable models to become more occlusion robust. We design experiments to include input-based augmentations as well as feature-based augmentations. Input-based augmentations involve finetuning on images where the occluder pixels are inpainted, and feature-based augmentations involve augmenting classification features with intermediate diffusion features. We demonstrate that our proposed use of diffusion-based features results in models that are more robust to partial object occlusions for both Transformers and ConvNets on ImageNet with simulated occlusions. We also propose a dataset that encompasses real-world occlusions and demonstrate that our method is more robust to partial object occlusions.

Paper Structure

This paper contains 9 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: This figure presents the pipeline of our training process for diffusion (a) inpainting augmentation, and (b) feature augmentation. In (a), an input image $x_i$, together with its part annotations $\tilde{x}_i$, simple class text prompt $t_i$, and part segmentation mask $m_i$ are fed into a SD inpainting module that generates an inpainted image $\hat{x}_i$. $\hat{x}_i$ (or baseline image) is fed into the Transformer/ConvNet model to produce model features $l_f$ that are then used for object recognition. In (b), an input image $x_i$, together with its part annotations $\tilde{x}_i$, and null prompt $t_i$ are fed into a frozen SD model where U-Net intermediate features $l_d$ are extracted. $l_f$ and $l_d$ are fused and used as an augmented feature $l_a$ that is then used for object recognition.
  • Figure 2: Original (row 1) shows images from the ImageNet training set. Occluded (row 2) presents the same images with blacked out image parts mimicking partial object occlusions. Inpainted (row 3) presents the generated parts using a Stable Diffusion pipeline Rombach_2022_CVPR. The generated inpainted images are then used for input augmentation.
  • Figure 3: This figure contrasts the two test setups we use. Row 1: Occlusions in the ImageNet Validation imagenet_cvpr09 set for a sample $60\%$ occlusion. Row 2: Real-world images from our D-feat dataset crawled to demonstrate occlusion of objects from particular classes of interest that overlap with the PartImageNet dataset.