Table of Contents
Fetching ...

LR0.FM: Low-Res Benchmark and Improving Robustness for Zero-Shot Classification in Foundation Models

Priyank Pathak, Shyam Marjit, Shruti Vyas, Yogesh S Rawat

TL;DR

This work introduces LR0.FM, a comprehensive benchmark to study zero-shot classification robustness of visual-language foundation models on very low-resolution inputs, across 66 backbones and 15 datasets. It identifies limitations of existing robustness metrics and proposes Weighted Aggregated Robustness (WAR) to provide fairer cross-dataset evaluations, alongside observations that model size and pre-training data quality shape LR robustness. To address LR vulnerability without retraining, the paper presents LR-TK0, which adds LR tokens to frozen transformers and uses diffusion-based synthetic HR data with multi-scale distillation to learn LR representations. Empirical results show LR-TK0 improves LR robustness across backbones with minimal HR performance loss, underscoring a practical path to real-world reliability for zero-shot vision-language models. The work thereby offers both a rigorous evaluation framework and a lightweight, generalizable mitigation strategy for NLP-vision foundation models operating under LR conditions.

Abstract

Visual-language foundation Models (FMs) exhibit remarkable zero-shot generalization across diverse tasks, largely attributed to extensive pre-training on largescale datasets. However, their robustness on low-resolution/pixelated (LR) images, a common challenge in real-world scenarios, remains underexplored. We introduce LR0.FM, a comprehensive benchmark evaluating the impact of low resolution on the zero-shot classification performance of 10 FM(s) across 66 backbones and 15 datasets. We propose a novel metric, Weighted Aggregated Robustness, to address the limitations of existing metrics and better evaluate model performance across resolutions and datasets. Our key findings show that: (i) model size positively correlates with robustness to resolution degradation, (ii) pre-training dataset quality is more important than its size, and (iii) fine-tuned and higher resolution models are less robust against LR. Our analysis further reveals that the model makes semantically reasonable predictions at LR, and the lack of fine-grained details in input adversely impacts the model's initial layers more than the deeper layers. We use these insights and introduce a simple strategy, LR-TK0, to enhance the robustness of models without compromising their pre-trained weights. We demonstrate the effectiveness of LR-TK0 for robustness against low-resolution across several datasets and its generalization capability across backbones and other approaches. Code is available at https://github.com/shyammarjit/LR0.FM

LR0.FM: Low-Res Benchmark and Improving Robustness for Zero-Shot Classification in Foundation Models

TL;DR

This work introduces LR0.FM, a comprehensive benchmark to study zero-shot classification robustness of visual-language foundation models on very low-resolution inputs, across 66 backbones and 15 datasets. It identifies limitations of existing robustness metrics and proposes Weighted Aggregated Robustness (WAR) to provide fairer cross-dataset evaluations, alongside observations that model size and pre-training data quality shape LR robustness. To address LR vulnerability without retraining, the paper presents LR-TK0, which adds LR tokens to frozen transformers and uses diffusion-based synthetic HR data with multi-scale distillation to learn LR representations. Empirical results show LR-TK0 improves LR robustness across backbones with minimal HR performance loss, underscoring a practical path to real-world reliability for zero-shot vision-language models. The work thereby offers both a rigorous evaluation framework and a lightweight, generalizable mitigation strategy for NLP-vision foundation models operating under LR conditions.

Abstract

Visual-language foundation Models (FMs) exhibit remarkable zero-shot generalization across diverse tasks, largely attributed to extensive pre-training on largescale datasets. However, their robustness on low-resolution/pixelated (LR) images, a common challenge in real-world scenarios, remains underexplored. We introduce LR0.FM, a comprehensive benchmark evaluating the impact of low resolution on the zero-shot classification performance of 10 FM(s) across 66 backbones and 15 datasets. We propose a novel metric, Weighted Aggregated Robustness, to address the limitations of existing metrics and better evaluate model performance across resolutions and datasets. Our key findings show that: (i) model size positively correlates with robustness to resolution degradation, (ii) pre-training dataset quality is more important than its size, and (iii) fine-tuned and higher resolution models are less robust against LR. Our analysis further reveals that the model makes semantically reasonable predictions at LR, and the lack of fine-grained details in input adversely impacts the model's initial layers more than the deeper layers. We use these insights and introduce a simple strategy, LR-TK0, to enhance the robustness of models without compromising their pre-trained weights. We demonstrate the effectiveness of LR-TK0 for robustness against low-resolution across several datasets and its generalization capability across backbones and other approaches. Code is available at https://github.com/shyammarjit/LR0.FM

Paper Structure

This paper contains 25 sections, 3 equations, 29 figures, 14 tables.

Figures (29)

  • Figure 1: Top-1 zero-shot classification accuracy (y-axis) vs resolution (x-axis): Backbones for foundation models are merged as shade, with average performance across backbones in the dark.
  • Figure 2: Zero-Shot misclassifications: EVA-CLIP [sun2023eva] correct classification at $224\!\times\!224$ (green) & misclassification at lower resolution (red). However, ImageNet labels-based mispredictions are semantically reasonable (humans), indicating viability of pre-trained weights at low resolution.
  • Figure 3: Left: Dataset: Size $\propto\log$ # test images, and color gradient $\propto$ # of test classes orange is 10 & black is 1000 classes). Right: Zero Shot Evaluation: Food-101 image ($32\!\times\!32$) generates image embeddings $f_{Img}$, while class labels are filled in templates (1 shown) generating text embeddings (averaged across templates). The dot product of $f_{Img}$ with text features gives classification logits.
  • Figure 4: Left: Improved$\Gamma^{D}_{n}$vs traditional$\gamma^{D}_{n}$: $\!\Gamma^{D}_{n}\!\approx\!\gamma^{D}_{n}\!$ except near random predictions ($\mathcal{E}_D\!\rightarrow\!0$). Mid: Correlation between the ordering of models after averaging of robustness (SAR) across datasets ($\gamma^{D}_{16}$ & $\Gamma^{D}_{16}$) with dataset's true ordering. SAR final ranking ignores datasets like EuroSAT (0.26). Right: Optimized dataset weights for WAR-16. Supplementary contains numeric value.
  • Figure 5: Evaluations at $\!16\!\!\times\!\!16\!$. Left:SAR vs WAR: WAR improves the correlation (between the ordering of models after aggregation with individual datasets) for EuroSAT (0.26 $\rightarrow$ 0.49 and ImageNet-A ($0.56\rightarrow 0.68$), both computed via $\Gamma^{D}_{16}$. Right: i) Model Size & ii) Pre-training dataset size positively impacts robustness. (i) Dot size $\propto$ GFLOPs, no impact on robustness (ii) Dot size $\propto$ Model Size, positively impact robustness. ResNets ($\star$), and transformers ($\bigcirc$).
  • ...and 24 more figures