Model Reprogramming Outperforms Fine-tuning on Out-of-distribution Data in Text-Image Encoders

Andrew Geng; Pin-Yu Chen

Model Reprogramming Outperforms Fine-tuning on Out-of-distribution Data in Text-Image Encoders

Andrew Geng, Pin-Yu Chen

TL;DR

This work addresses the problem that standard fine-tuning of CLIP-like text-image encoders can degrade out-of-distribution (OOD) generalization and OOD detection. It introduces Reprogrammer, a lightweight input-transformation approach that reuses pre-trained parameters, and Residual Reprogrammer, which adds a residual connection to better preserve pre-training representations. Empirical results on CIFAR-10 and ImageNet-1k show that Reprogrammer methods consistently outperform traditional fine-tuning across ID, OOD generalization, and OOD detection, with Residual Reprogrammer achieving the strongest holistic gains. The study highlights the importance of maintaining pre-training representations for robust downstream performance and suggests reprogramming as a practical, efficient alternative for multi-modal text-image encoders.

Abstract

When evaluating the performance of a pre-trained model transferred to a downstream task, it is imperative to assess not only the in-distribution (ID) accuracy of the downstream model but also its capacity to generalize and identify out-of-distribution (OOD) samples. In this paper, we unveil the hidden costs associated with intrusive fine-tuning techniques. Specifically, we demonstrate that commonly used fine-tuning methods not only distort the representations necessary for generalizing to covariate-shifted OOD samples (OOD generalization) but also distort the representations necessary for detecting semantically-shifted OOD samples (OOD detection). To address these challenges, we introduce a new model reprogramming approach for fine-tuning, which we name Reprogrammer. Reprogrammer aims to improve the holistic performance of the downstream model across ID, OOD generalization, and OOD detection tasks. Our empirical evidence reveals that Reprogrammer is less intrusive and yields superior downstream models. Furthermore, we demonstrate that by appending an additional representation residual connection to Reprogrammer, we can further preserve pre-training representations, resulting in an even more safe and robust downstream model capable of excelling in many ID classification, OOD generalization, and OOD detection settings.

Model Reprogramming Outperforms Fine-tuning on Out-of-distribution Data in Text-Image Encoders

TL;DR

Abstract

Paper Structure (53 sections, 8 equations, 4 figures, 11 tables)

This paper contains 53 sections, 8 equations, 4 figures, 11 tables.

Introduction
Background and Related Work
Pre-trained and CLIP-like Models:
Out-of-distribution Generalization:
Out-of-distribution Detection:
Model Reprogramming:
Methodology
Image Reprogramming
Text Reprogramming
Reprogrammer
Residual Reprogrammer
Experiments
Experimental Setup
In-distribution dataset:
Out-of-distribution Generalization:
...and 38 more sections

Figures (4)

Figure 1: Radar charts illustrating the trade-offs between ID, OOD generalization, and OOD detection performances across linear-probing, full fine-tuning, reprogrammer, and residual reprogrammer. All results are based on the CIFAR benchmarks. To quantify the cost-performance trade-offs, we report the average scores normalized across all metrics.
Figure 2: Visual diagrams illustrating the image reprogramming and text reprogramming functions. In the image reprogramming function, an input image undergoes resizing and padding, followed by the addition of a learnable edge perturbation. Similarly, in the text reprogramming function, an input caption is tokenized before a lookup table and bias embedding are applied. Subsequently, both the reprogrammed image and caption embeddings are passed through the fixed text-image encoder during a model forward pass.
Figure 3: Visual diagram illustrating the reprogrammer and residual reprogrammer training schema based on the CLIP joint image and text encoder setting. During reprogrammer training, an image and caption pair each independently undergoes their respective reprogramming functions before being passed into the CLIP image and text encoders. A loss is then computed based on the cosine similarity of the two reprogrammed features. Then we subsequently backpropagate and optimize each parameter associated with the image and text reprogramming function. During inference time, residual reprogrammer leverages a residual connection that combines the reprogrammed representation and zero-shot representations.
Figure 4: Ablation Studies. Figures \ref{['fig:cifar_padding_ablation']}, \ref{['fig:imagenet_padding_ablation']} illustrate the effectiveness of our reprogrammer method as we adjust the image reprogramming padding size. A larger padding size indicates that more of the input image is being subjected to the reprogramming function. Additionally, we present UMAP visualization comparing the feature spaces between linear-probed and reprogrammer models using $500$ randomly sampled covariate shifted (CIFAR-10.1) images in Figures \ref{['fig:lp_embedding_ablation']}, \ref{['fig:rp_embedding_ablation']}.

Model Reprogramming Outperforms Fine-tuning on Out-of-distribution Data in Text-Image Encoders

TL;DR

Abstract

Model Reprogramming Outperforms Fine-tuning on Out-of-distribution Data in Text-Image Encoders

Authors

TL;DR

Abstract

Table of Contents

Figures (4)