Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images
Hamza Riaz, Alan F. Smeaton
TL;DR
This paper interrogates the domain generalisation capabilities of vision transformers, focusing on BEIT with masked image modelling. It introduces a scalable synthetic OOD benchmarking framework by overlaying controllable grid occlusions and using zero-shot segmentation (SAM) with Grounding DINO to localise objects, evaluating on PACS, Office-Home, and DomainNet. BEIT demonstrates strong resilience to occlusions and substantially narrows the IID–OOD gap compared with CNNs and other transformers, while attention-distance analysis links DG to global feature reliance. The work also reveals critical failure modes when occlusions distort object shapes and provides practical guidelines for deploying DG-friendly architectures in uncertain real-world settings. These contributions offer a blueprint for robust DG evaluation and insights into how global attention and denoising capabilities support reliable generalisation.
Abstract
Modern AI models excel in controlled settings but often fail in real-world scenarios where data distributions shift unpredictably - a challenge known as domain generalisation (DG). This paper tackles this limitation by rigorously evaluating vision tramsformers, specifically the BEIT architecture which is a model pre-trained with masked image modelling (MIM), against synthetic out-of-distribution (OOD) benchmarks designed to mimic real-world noise and occlusions. We introduce a novel framework to generate OOD test cases by strategically masking object regions in images using grid patterns (25\%, 50\%, 75\% occlusion) and leveraging cutting-edge zero-shot segmentation via Segment Anything and Grounding DINO to ensure precise object localisation. Experiments across three benchmarks (PACS, Office-Home, DomainNet) demonstrate BEIT's known robustness while maintaining 94\% accuracy on PACS and 87\% on Office-Home, despite significant occlusions, outperforming CNNs and other vision transformers by margins of up to 37\%. Analysis of self-attention distances reveals that the BEIT dependence on global features correlates with its resilience. Furthermore, our synthetic benchmarks expose critical failure modes: performance degrades sharply when occlusions disrupt object shapes e.g. 68\% drop for external grid masking vs. 22\% for internal masking. This work provides two key advances (1) a scalable method to generate OOD benchmarks using controllable noise, and (2) empirical evidence that MIM and self-attention mechanism in vision transformers enhance DG by learning invariant features. These insights bridge the gap between lab-trained models and real-world deployment that offer a blueprint for building AI systems that generalise reliably under uncertainty.
