Table of Contents
Fetching ...

Task-Specific Adaptation with Restricted Model Access

Matan Levy, Rami Ben-Ari, Dvir Samuel, Nir Darshan, Dani Lischinski

TL;DR

This work tackles the challenge of adapting foundation multimodal models to new tasks without exposing their weights or architecture. It introduces two Gray-box frameworks, DarkGray-box Input/Output Adapters (DGA) and LightGray-box (LGA), which attach lightweight adapters to the model's input/output or internal layers, while keeping the backbone frozen and hidden. Across text-to-image, text-to-video, and sketch-to-image benchmarks, these methods achieve competitive results relative to full fine-tuning and LoRA, with significantly reduced exposure and parameter updates. The findings demonstrate that Gray-box adaptation can offer practical, privacy-preserving, and deployment-efficient task-specific performance, especially in domains closer to the backbone's training data, while highlighting limitations in highly distant domains where some internal weighting may be necessary.

Abstract

The emergence of foundational models has greatly improved performance across various downstream tasks, with fine-tuning often yielding even better results. However, existing fine-tuning approaches typically require access to model weights and layers, leading to challenges such as managing multiple model copies or inference pipelines, inefficiencies in edge device optimization, and concerns over proprietary rights, privacy, and exposure to unsafe model variants. In this paper, we address these challenges by exploring "Gray-box" fine-tuning approaches, where the model's architecture and weights remain hidden, allowing only gradient propagation. We introduce a novel yet simple and effective framework that adapts to new tasks using two lightweight learnable modules at the model's input and output. Additionally, we present a less restrictive variant that offers more entry points into the model, balancing performance with model exposure. We evaluate our approaches across several backbones on benchmarks such as text-image alignment, text-video alignment, and sketch-image alignment. Results show that our Gray-box approaches are competitive with full-access fine-tuning methods, despite having limited access to the model.

Task-Specific Adaptation with Restricted Model Access

TL;DR

This work tackles the challenge of adapting foundation multimodal models to new tasks without exposing their weights or architecture. It introduces two Gray-box frameworks, DarkGray-box Input/Output Adapters (DGA) and LightGray-box (LGA), which attach lightweight adapters to the model's input/output or internal layers, while keeping the backbone frozen and hidden. Across text-to-image, text-to-video, and sketch-to-image benchmarks, these methods achieve competitive results relative to full fine-tuning and LoRA, with significantly reduced exposure and parameter updates. The findings demonstrate that Gray-box adaptation can offer practical, privacy-preserving, and deployment-efficient task-specific performance, especially in domains closer to the backbone's training data, while highlighting limitations in highly distant domains where some internal weighting may be necessary.

Abstract

The emergence of foundational models has greatly improved performance across various downstream tasks, with fine-tuning often yielding even better results. However, existing fine-tuning approaches typically require access to model weights and layers, leading to challenges such as managing multiple model copies or inference pipelines, inefficiencies in edge device optimization, and concerns over proprietary rights, privacy, and exposure to unsafe model variants. In this paper, we address these challenges by exploring "Gray-box" fine-tuning approaches, where the model's architecture and weights remain hidden, allowing only gradient propagation. We introduce a novel yet simple and effective framework that adapts to new tasks using two lightweight learnable modules at the model's input and output. Additionally, we present a less restrictive variant that offers more entry points into the model, balancing performance with model exposure. We evaluate our approaches across several backbones on benchmarks such as text-image alignment, text-video alignment, and sketch-image alignment. Results show that our Gray-box approaches are competitive with full-access fine-tuning methods, despite having limited access to the model.

Paper Structure

This paper contains 17 sections, 5 figures, 18 tables.

Figures (5)

  • Figure 1: An overview of our gray-box frameworks. Left: DarkGray-Box Input/Output Adapters (DGA) permits modifications only at the input and output levels while keeping the backbone model hidden and frozen. The only information available is the gradient flow (indicated by the orange-dotted arrow), which matches the shape of the last layer of the input adapter. Right: In contrast, LighGray-box (LGA) allows additional entry points into the model's intermediate layers, exposing slightly more information, such as the input dimensionality and the gradients of a subset of the layers.
  • Figure 2: An overview of our Input Adapters. The visual input adapter (left) consists of 2D task-specific convolutional layers that preserve the image's original size. The textual input adapter (right) includes two task-specific tokens: a "shift" token added to the original sequence tokens and an "extra" token appended to the original sequence as a contextual token. Both adapters transform the original input into a new representation that better aligns with the pre-trained backbone model.
  • Figure 3: Generated images by three different model versions, of Original (zero-shot), LoRA and LGA.
  • Figure 4: Visualization of the input adapter's influence on images.
  • Figure 5: General schemes for handling $N$ different tasks or domains. Top: A single optimized model designed for multiple tasks or domains. Bottom: A naive approach with $N$ different models, one for each task.