Table of Contents
Fetching ...

Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images

Hamza Riaz, Alan F. Smeaton

TL;DR

This paper interrogates the domain generalisation capabilities of vision transformers, focusing on BEIT with masked image modelling. It introduces a scalable synthetic OOD benchmarking framework by overlaying controllable grid occlusions and using zero-shot segmentation (SAM) with Grounding DINO to localise objects, evaluating on PACS, Office-Home, and DomainNet. BEIT demonstrates strong resilience to occlusions and substantially narrows the IID–OOD gap compared with CNNs and other transformers, while attention-distance analysis links DG to global feature reliance. The work also reveals critical failure modes when occlusions distort object shapes and provides practical guidelines for deploying DG-friendly architectures in uncertain real-world settings. These contributions offer a blueprint for robust DG evaluation and insights into how global attention and denoising capabilities support reliable generalisation.

Abstract

Modern AI models excel in controlled settings but often fail in real-world scenarios where data distributions shift unpredictably - a challenge known as domain generalisation (DG). This paper tackles this limitation by rigorously evaluating vision tramsformers, specifically the BEIT architecture which is a model pre-trained with masked image modelling (MIM), against synthetic out-of-distribution (OOD) benchmarks designed to mimic real-world noise and occlusions. We introduce a novel framework to generate OOD test cases by strategically masking object regions in images using grid patterns (25\%, 50\%, 75\% occlusion) and leveraging cutting-edge zero-shot segmentation via Segment Anything and Grounding DINO to ensure precise object localisation. Experiments across three benchmarks (PACS, Office-Home, DomainNet) demonstrate BEIT's known robustness while maintaining 94\% accuracy on PACS and 87\% on Office-Home, despite significant occlusions, outperforming CNNs and other vision transformers by margins of up to 37\%. Analysis of self-attention distances reveals that the BEIT dependence on global features correlates with its resilience. Furthermore, our synthetic benchmarks expose critical failure modes: performance degrades sharply when occlusions disrupt object shapes e.g. 68\% drop for external grid masking vs. 22\% for internal masking. This work provides two key advances (1) a scalable method to generate OOD benchmarks using controllable noise, and (2) empirical evidence that MIM and self-attention mechanism in vision transformers enhance DG by learning invariant features. These insights bridge the gap between lab-trained models and real-world deployment that offer a blueprint for building AI systems that generalise reliably under uncertainty.

Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images

TL;DR

This paper interrogates the domain generalisation capabilities of vision transformers, focusing on BEIT with masked image modelling. It introduces a scalable synthetic OOD benchmarking framework by overlaying controllable grid occlusions and using zero-shot segmentation (SAM) with Grounding DINO to localise objects, evaluating on PACS, Office-Home, and DomainNet. BEIT demonstrates strong resilience to occlusions and substantially narrows the IID–OOD gap compared with CNNs and other transformers, while attention-distance analysis links DG to global feature reliance. The work also reveals critical failure modes when occlusions distort object shapes and provides practical guidelines for deploying DG-friendly architectures in uncertain real-world settings. These contributions offer a blueprint for robust DG evaluation and insights into how global attention and denoising capabilities support reliable generalisation.

Abstract

Modern AI models excel in controlled settings but often fail in real-world scenarios where data distributions shift unpredictably - a challenge known as domain generalisation (DG). This paper tackles this limitation by rigorously evaluating vision tramsformers, specifically the BEIT architecture which is a model pre-trained with masked image modelling (MIM), against synthetic out-of-distribution (OOD) benchmarks designed to mimic real-world noise and occlusions. We introduce a novel framework to generate OOD test cases by strategically masking object regions in images using grid patterns (25\%, 50\%, 75\% occlusion) and leveraging cutting-edge zero-shot segmentation via Segment Anything and Grounding DINO to ensure precise object localisation. Experiments across three benchmarks (PACS, Office-Home, DomainNet) demonstrate BEIT's known robustness while maintaining 94\% accuracy on PACS and 87\% on Office-Home, despite significant occlusions, outperforming CNNs and other vision transformers by margins of up to 37\%. Analysis of self-attention distances reveals that the BEIT dependence on global features correlates with its resilience. Furthermore, our synthetic benchmarks expose critical failure modes: performance degrades sharply when occlusions disrupt object shapes e.g. 68\% drop for external grid masking vs. 22\% for internal masking. This work provides two key advances (1) a scalable method to generate OOD benchmarks using controllable noise, and (2) empirical evidence that MIM and self-attention mechanism in vision transformers enhance DG by learning invariant features. These insights bridge the gap between lab-trained models and real-world deployment that offer a blueprint for building AI systems that generalise reliably under uncertainty.

Paper Structure

This paper contains 29 sections, 10 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Visual illustration of selected OOD benchmarks, different versions of ImageNet, to test the robustness of models.
  • Figure 2: Generated data distributions for PACS with 12 variations to measure the resilience of the BEIT model. The y-axis shows types of newly generated distributions based on the number of grids and the x-axis shows different occlusion ratios.
  • Figure 3: Generated data distributions for Office-Home with 12 variations to measure the resilience of the BEIT model. The 25% means simple grids, 50% means a checkerboard pattern, and 75% means checkerboard with overlapping between the units.
  • Figure 4: Generated data distributions for DomainNet with 12 variations to measure the resilience of the BEIT model.
  • Figure 5: Representation of accuracy and loss
  • ...and 6 more figures