Table of Contents
Fetching ...

HyperCLIP: Adapting Vision-Language models with Hypernetworks

Victor Akinwande, Mohammad Sadegh Norouzzadeh, Devin Willmott, Anna Bair, Madan Ravi Ganesh, J. Zico Kolter

TL;DR

HyperCLIP addresses the deployment bottleneck of large vision–language models by using a hypernetwork to dynamically adapt a small image encoder conditioned on text inputs, enabling end-to-end pretraining and efficient zero-shot classification. The approach keeps the text encoder and a compact image encoder while the hypernetwork outputs a task-specific set of normalization parameters, yielding consistent improvements over SigLIP with minimal throughput overhead. Key contributions include a Transformer-based hypernetwork design, selective parameter adaptation of normalization layers, and extensive cross-architecture evaluation showing gains on ImageNet and CIFAR benchmarks as well as robustness to distribution shifts and fairness tasks. This work enables practical edge deployment of vision–language models without explicit distillation or specialized hardware, advancing scalable, deployment-friendly VLMs.

Abstract

Self-supervised vision-language models trained with contrastive objectives form the basis of current state-of-the-art methods in AI vision tasks. The success of these models is a direct consequence of the huge web-scale datasets used to train them, but they require correspondingly large vision components to properly learn powerful and general representations from such a broad data domain. This poses a challenge for deploying large vision-language models, especially in resource-constrained environments. To address this, we propose an alternate vision-language architecture, called HyperCLIP, that uses a small image encoder along with a hypernetwork that dynamically adapts image encoder weights to each new set of text inputs. All three components of the model (hypernetwork, image encoder, and text encoder) are pre-trained jointly end-to-end, and with a trained HyperCLIP model, we can generate new zero-shot deployment-friendly image classifiers for any task with a single forward pass through the text encoder and hypernetwork. HyperCLIP increases the zero-shot accuracy of SigLIP trained models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead.

HyperCLIP: Adapting Vision-Language models with Hypernetworks

TL;DR

HyperCLIP addresses the deployment bottleneck of large vision–language models by using a hypernetwork to dynamically adapt a small image encoder conditioned on text inputs, enabling end-to-end pretraining and efficient zero-shot classification. The approach keeps the text encoder and a compact image encoder while the hypernetwork outputs a task-specific set of normalization parameters, yielding consistent improvements over SigLIP with minimal throughput overhead. Key contributions include a Transformer-based hypernetwork design, selective parameter adaptation of normalization layers, and extensive cross-architecture evaluation showing gains on ImageNet and CIFAR benchmarks as well as robustness to distribution shifts and fairness tasks. This work enables practical edge deployment of vision–language models without explicit distillation or specialized hardware, advancing scalable, deployment-friendly VLMs.

Abstract

Self-supervised vision-language models trained with contrastive objectives form the basis of current state-of-the-art methods in AI vision tasks. The success of these models is a direct consequence of the huge web-scale datasets used to train them, but they require correspondingly large vision components to properly learn powerful and general representations from such a broad data domain. This poses a challenge for deploying large vision-language models, especially in resource-constrained environments. To address this, we propose an alternate vision-language architecture, called HyperCLIP, that uses a small image encoder along with a hypernetwork that dynamically adapts image encoder weights to each new set of text inputs. All three components of the model (hypernetwork, image encoder, and text encoder) are pre-trained jointly end-to-end, and with a trained HyperCLIP model, we can generate new zero-shot deployment-friendly image classifiers for any task with a single forward pass through the text encoder and hypernetwork. HyperCLIP increases the zero-shot accuracy of SigLIP trained models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead.

Paper Structure

This paper contains 21 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (Left) The traditional CLIP architecture with SigLIP loss. (Right) the HyperCLIP variant. Overview of HyperCLIP. We use an hypernetwork to generate the weights of a smaller vision model within the SigLIP contrastive pre-training framework. The entire setup is trained end-end. HyperCLIP increases the zero-shot accuracy of SigLIP models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead.
  • Figure 2: (Left) Overview of the hypernetwork. We process the text embedding using a transformer and directly output the normalization scale and bias parameters.
  • Figure 3: (Left) Impact of Normalization finetuning: HyperCLIP improves by 1.83% over SigLIP on CIFAR-100, with a 3.34% gap between SigLIP-Probing and SigLIP-Upper Bound.
  • Figure 4: Performance delta when the transformer in HyperCLIP is removed on family of EfficientNet models. Top-1 zero-shot accuracy on classification, top-1 mean recall for retrieval tasks, and worst-group top-1 zero-shot accuracy for fairness tasks.
  • Figure 5: Evolution of parameter norms (left) and update norms (right) over 100 epochs for models trained with CLIP loss, SigLip loss, and SigLip loss with a hypernetwork. The CLIP model shows a steady decline in norms, while the SigLip models demonstrate varying behaviors, with the hypernetwork variant achieving stable updates.