Table of Contents
Fetching ...

Fantastic Features and Where to Find Them: A Probing Method to combine Features from Multiple Foundation Models

Benjamin Ramtoula, Pierre-Yves Lajoie, Paul Newman, Daniele De Martini

TL;DR

The paper tackles how to harness complementary representations from multiple foundation models for downstream vision tasks without retraining large backbones. It introduces ComBo, a probing-based adapter that compresses and fuses multi-layer activations from frozen models via a shared affine projection and a lightweight transformer, eliminating dataset-specific tuning. A task-relevance mechanism selects the most informative backbones, enabling efficient model combination; across VTAB-1k, ComBo outperforms prior probing methods and matches or surpasses distillation- and tuning-based approaches with far lower computational cost. The work demonstrates practical, scalable multi-model integration and offers a principled backbone-selection strategy, though it notes limitations such as feature-map alignment assumptions and reliance on ViT-B/VTAB-1k settings.

Abstract

Foundation models (FMs) trained with different objectives and data learn diverse representations, making some more effective than others for specific downstream tasks. Existing adaptation strategies, such as parameter-efficient fine-tuning, focus on individual models and do not exploit the complementary strengths across models. Probing methods offer a promising alternative by extracting information from frozen models, but current techniques do not scale well with large feature sets and often rely on dataset-specific hyperparameter tuning. We propose Combined backBones (ComBo), a simple and scalable probing-based adapter that effectively integrates features from multiple models and layers. ComBo compresses activations from layers of one or more FMs into compact token-wise representations and processes them with a lightweight transformer for task-specific prediction. Crucially, ComBo does not require dataset-specific tuning or backpropagation through the backbone models. However, not all models are equally relevant for all tasks. To address this, we introduce a mechanism that leverages ComBo's joint multi-backbone probing to efficiently evaluate each backbone's task-relevance, enabling both practical model comparison and improved performance through selective adaptation. On the 19 tasks of the VTAB-1k benchmark, ComBo outperforms previous probing methods, matches or surpasses more expensive alternatives, such as distillation-based model merging, and enables efficient probing of tuned models. Our results demonstrate that ComBo offers a practical and general-purpose framework for combining diverse representations from multiple FMs.

Fantastic Features and Where to Find Them: A Probing Method to combine Features from Multiple Foundation Models

TL;DR

The paper tackles how to harness complementary representations from multiple foundation models for downstream vision tasks without retraining large backbones. It introduces ComBo, a probing-based adapter that compresses and fuses multi-layer activations from frozen models via a shared affine projection and a lightweight transformer, eliminating dataset-specific tuning. A task-relevance mechanism selects the most informative backbones, enabling efficient model combination; across VTAB-1k, ComBo outperforms prior probing methods and matches or surpasses distillation- and tuning-based approaches with far lower computational cost. The work demonstrates practical, scalable multi-model integration and offers a principled backbone-selection strategy, though it notes limitations such as feature-map alignment assumptions and reliance on ViT-B/VTAB-1k settings.

Abstract

Foundation models (FMs) trained with different objectives and data learn diverse representations, making some more effective than others for specific downstream tasks. Existing adaptation strategies, such as parameter-efficient fine-tuning, focus on individual models and do not exploit the complementary strengths across models. Probing methods offer a promising alternative by extracting information from frozen models, but current techniques do not scale well with large feature sets and often rely on dataset-specific hyperparameter tuning. We propose Combined backBones (ComBo), a simple and scalable probing-based adapter that effectively integrates features from multiple models and layers. ComBo compresses activations from layers of one or more FMs into compact token-wise representations and processes them with a lightweight transformer for task-specific prediction. Crucially, ComBo does not require dataset-specific tuning or backpropagation through the backbone models. However, not all models are equally relevant for all tasks. To address this, we introduce a mechanism that leverages ComBo's joint multi-backbone probing to efficiently evaluate each backbone's task-relevance, enabling both practical model comparison and improved performance through selective adaptation. On the 19 tasks of the VTAB-1k benchmark, ComBo outperforms previous probing methods, matches or surpasses more expensive alternatives, such as distillation-based model merging, and enables efficient probing of tuned models. Our results demonstrate that ComBo offers a practical and general-purpose framework for combining diverse representations from multiple FMs.

Paper Structure

This paper contains 49 sections, 2 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: fm (left) Different fm learn diverse representations, and given different downstream tasks, the most capable pre-trained model might not be the same. Not only can the model vary, but the location of the layer containing the most directly relevant features can also change, with intermediate layers occasionally leading to the highest probing accuracy. (right) We propose an adapter that can take advantage of these diverse representations by probing layer outputs from multiple frozen models. This allows us to efficiently adapt to diverse downstream tasks without needing to backpropagate through potentially large fm.
  • Figure 2: Images that maximise activations of different neurons of different models radfordLearningTransferableVisual2021bkirillov2023seganyMaskedAutoencoders2021steiner2022howdarcet2023vitneedreg, using the technique from Ghiasi et al. ghiasiWhatVisionTransformers2022. All models rely on a ViT-B architecture, but each was trained with different data and supervision. Although all models usually pick up low-level patterns in early layers, distinct patterns appear as we get to intermediate and later layers, highlighting differences in learned representations. Models trained with ImageNet-21K, MAE, or SAM appear more sensitive to specific textures in later layers, whereas late CLIP neurons appear to react to patterns of semantic entities, and DINOv2 late neurons appear more sensitive to complex abstract shapes. These observations align with the idea that diverse features can be found by using multiple models, but also by using multiple layer outputs. Images are generated from randomly sampled neurons.
  • Figure 3: The adapter. Given intermediate feature maps from multiple models, we first learn a small projection $\Lambda$ of their combined layers' embeddings which we apply to all their tokens. These tokens are then passed to a small transformer $\mathcal{F}$ model which outputs a cls token on which we place our classification head.
  • Figure 4: Additional visualisations extending \ref{['fig:vis_feature_maps_fms']}. These images are optimised to maximise activations of different neurons across different layers using the technique from Ghiasi et al. ghiasiWhatVisionTransformers2022. We also include visualisations for a with randomly initialised weights. All images are generated from neurons that were randomly sampled.
  • Figure 5: Results of linear probing as presented in \ref{['fig:linear_probe_fms']} for all VTAB-1k zhaiLargescaleStudyRepresentation2020a datasets. The plots present validation accuracies of linear probing on activations from different layers of trained from CLIP radfordLearningTransferableVisual2021b, DINOv2 darcet2023vitneedreg, ImageNet-21K steiner2022how, MAE MaskedAutoencoders2021, SAM kirillov2023segany, as well as a randomly initialised .
  • ...and 2 more figures