LR0.FM: Low-Res Benchmark and Improving Robustness for Zero-Shot Classification in Foundation Models
Priyank Pathak, Shyam Marjit, Shruti Vyas, Yogesh S Rawat
TL;DR
This work introduces LR0.FM, a comprehensive benchmark to study zero-shot classification robustness of visual-language foundation models on very low-resolution inputs, across 66 backbones and 15 datasets. It identifies limitations of existing robustness metrics and proposes Weighted Aggregated Robustness (WAR) to provide fairer cross-dataset evaluations, alongside observations that model size and pre-training data quality shape LR robustness. To address LR vulnerability without retraining, the paper presents LR-TK0, which adds LR tokens to frozen transformers and uses diffusion-based synthetic HR data with multi-scale distillation to learn LR representations. Empirical results show LR-TK0 improves LR robustness across backbones with minimal HR performance loss, underscoring a practical path to real-world reliability for zero-shot vision-language models. The work thereby offers both a rigorous evaluation framework and a lightweight, generalizable mitigation strategy for NLP-vision foundation models operating under LR conditions.
Abstract
Visual-language foundation Models (FMs) exhibit remarkable zero-shot generalization across diverse tasks, largely attributed to extensive pre-training on largescale datasets. However, their robustness on low-resolution/pixelated (LR) images, a common challenge in real-world scenarios, remains underexplored. We introduce LR0.FM, a comprehensive benchmark evaluating the impact of low resolution on the zero-shot classification performance of 10 FM(s) across 66 backbones and 15 datasets. We propose a novel metric, Weighted Aggregated Robustness, to address the limitations of existing metrics and better evaluate model performance across resolutions and datasets. Our key findings show that: (i) model size positively correlates with robustness to resolution degradation, (ii) pre-training dataset quality is more important than its size, and (iii) fine-tuned and higher resolution models are less robust against LR. Our analysis further reveals that the model makes semantically reasonable predictions at LR, and the lack of fine-grained details in input adversely impacts the model's initial layers more than the deeper layers. We use these insights and introduce a simple strategy, LR-TK0, to enhance the robustness of models without compromising their pre-trained weights. We demonstrate the effectiveness of LR-TK0 for robustness against low-resolution across several datasets and its generalization capability across backbones and other approaches. Code is available at https://github.com/shyammarjit/LR0.FM
