Table of Contents
Fetching ...

Multi-Scale Visual Prompting for Lightweight Small-Image Classification

Salim Khazem

TL;DR

This work tackles the challenge of improving small-image classification with lightweight adaptations by introducing Multi-Scale Visual Prompting (MSVP), a pixel-space prompting module that injects global, mid-scale, and local prompt maps into input images. By fusing these prompts with the input before any feature extraction, MSVP serves as a scalable, backbone-agnostic inductive bias that requires negligible parameter overhead. Across MNIST, Fashion-MNIST, and CIFAR-10, MSVP consistently improves performance for CNNs, ResNet-18, and ViT-Tiny, with the largest gains observed on more complex tasks and for transformer backbones. Qualitative analyses and ablations confirm that the multi-scale prompts guide attention and decision boundaries in a meaningful, interpretable way, suggesting broad applicability of pixel-space prompting for small-scale vision problems.

Abstract

Visual prompting has recently emerged as an efficient strategy to adapt vision models using lightweight, learnable parameters injected into the input space. However, prior work mainly targets large Vision Transformers and high-resolution datasets such as ImageNet. In contrast, small-image benchmarks like MNIST, Fashion-MNIST, and CIFAR-10 remain widely used in education, prototyping, and research, yet have received little attention in the context of prompting. In this paper, we introduce \textbf{Multi-Scale Visual Prompting (MSVP)}, a simple and generic module that learns a set of global, mid-scale, and local prompt maps fused with the input image via a lightweight $1 \times 1$ convolution. MSVP is backbone-agnostic, adds less than $0.02\%$ parameters, and significantly improves performance across CNN and Vision Transformer backbones. We provide a unified benchmark on MNIST, Fashion-MNIST, and CIFAR-10 using a simple CNN, ResNet-18, and a small Vision Transformer. Our method yields consistent improvements with negligible computational overhead. We further include ablations on prompt scales, fusion strategies, and backbone architectures, along with qualitative analyzes using prompt visualizations and Grad-CAM. Our results demonstrate that multi-scale prompting provides an effective inductive bias even on low-resolution images.

Multi-Scale Visual Prompting for Lightweight Small-Image Classification

TL;DR

This work tackles the challenge of improving small-image classification with lightweight adaptations by introducing Multi-Scale Visual Prompting (MSVP), a pixel-space prompting module that injects global, mid-scale, and local prompt maps into input images. By fusing these prompts with the input before any feature extraction, MSVP serves as a scalable, backbone-agnostic inductive bias that requires negligible parameter overhead. Across MNIST, Fashion-MNIST, and CIFAR-10, MSVP consistently improves performance for CNNs, ResNet-18, and ViT-Tiny, with the largest gains observed on more complex tasks and for transformer backbones. Qualitative analyses and ablations confirm that the multi-scale prompts guide attention and decision boundaries in a meaningful, interpretable way, suggesting broad applicability of pixel-space prompting for small-scale vision problems.

Abstract

Visual prompting has recently emerged as an efficient strategy to adapt vision models using lightweight, learnable parameters injected into the input space. However, prior work mainly targets large Vision Transformers and high-resolution datasets such as ImageNet. In contrast, small-image benchmarks like MNIST, Fashion-MNIST, and CIFAR-10 remain widely used in education, prototyping, and research, yet have received little attention in the context of prompting. In this paper, we introduce \textbf{Multi-Scale Visual Prompting (MSVP)}, a simple and generic module that learns a set of global, mid-scale, and local prompt maps fused with the input image via a lightweight convolution. MSVP is backbone-agnostic, adds less than parameters, and significantly improves performance across CNN and Vision Transformer backbones. We provide a unified benchmark on MNIST, Fashion-MNIST, and CIFAR-10 using a simple CNN, ResNet-18, and a small Vision Transformer. Our method yields consistent improvements with negligible computational overhead. We further include ablations on prompt scales, fusion strategies, and backbone architectures, along with qualitative analyzes using prompt visualizations and Grad-CAM. Our results demonstrate that multi-scale prompting provides an effective inductive bias even on low-resolution images.

Paper Structure

This paper contains 25 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Multi-Scale Visual Prompting architecture. Three learnable prompts at different spatial scales (global $1\times1$, mid $4\times4$, local $8\times8$) are upsampled to input resolution and fused via element-wise addition.
  • Figure 2: Test accuracy comparison on Fashion-MNIST. MS-VP consistently improves performance across all three backbone architectures, with ViT-Tiny showing the largest gain (+0.92%).
  • Figure 3: Confusion matrices for ResNet-18 on Fashion-MNIST. MS-VP (b) reduces confusion between visually similar classes (e.g., shirt vs. T-shirt, pullover vs. coat).
  • Figure 4: Accuracy improvement from MS-VP across datasets and models. Improvements correlate with task complexity: minimal on MNIST (ceiling effect), moderate on Fashion-MNIST.