Task-Specific Adaptation with Restricted Model Access
Matan Levy, Rami Ben-Ari, Dvir Samuel, Nir Darshan, Dani Lischinski
TL;DR
This work tackles the challenge of adapting foundation multimodal models to new tasks without exposing their weights or architecture. It introduces two Gray-box frameworks, DarkGray-box Input/Output Adapters (DGA) and LightGray-box (LGA), which attach lightweight adapters to the model's input/output or internal layers, while keeping the backbone frozen and hidden. Across text-to-image, text-to-video, and sketch-to-image benchmarks, these methods achieve competitive results relative to full fine-tuning and LoRA, with significantly reduced exposure and parameter updates. The findings demonstrate that Gray-box adaptation can offer practical, privacy-preserving, and deployment-efficient task-specific performance, especially in domains closer to the backbone's training data, while highlighting limitations in highly distant domains where some internal weighting may be necessary.
Abstract
The emergence of foundational models has greatly improved performance across various downstream tasks, with fine-tuning often yielding even better results. However, existing fine-tuning approaches typically require access to model weights and layers, leading to challenges such as managing multiple model copies or inference pipelines, inefficiencies in edge device optimization, and concerns over proprietary rights, privacy, and exposure to unsafe model variants. In this paper, we address these challenges by exploring "Gray-box" fine-tuning approaches, where the model's architecture and weights remain hidden, allowing only gradient propagation. We introduce a novel yet simple and effective framework that adapts to new tasks using two lightweight learnable modules at the model's input and output. Additionally, we present a less restrictive variant that offers more entry points into the model, balancing performance with model exposure. We evaluate our approaches across several backbones on benchmarks such as text-image alignment, text-video alignment, and sketch-image alignment. Results show that our Gray-box approaches are competitive with full-access fine-tuning methods, despite having limited access to the model.
