Fantastic Features and Where to Find Them: A Probing Method to combine Features from Multiple Foundation Models
Benjamin Ramtoula, Pierre-Yves Lajoie, Paul Newman, Daniele De Martini
TL;DR
The paper tackles how to harness complementary representations from multiple foundation models for downstream vision tasks without retraining large backbones. It introduces ComBo, a probing-based adapter that compresses and fuses multi-layer activations from frozen models via a shared affine projection and a lightweight transformer, eliminating dataset-specific tuning. A task-relevance mechanism selects the most informative backbones, enabling efficient model combination; across VTAB-1k, ComBo outperforms prior probing methods and matches or surpasses distillation- and tuning-based approaches with far lower computational cost. The work demonstrates practical, scalable multi-model integration and offers a principled backbone-selection strategy, though it notes limitations such as feature-map alignment assumptions and reliance on ViT-B/VTAB-1k settings.
Abstract
Foundation models (FMs) trained with different objectives and data learn diverse representations, making some more effective than others for specific downstream tasks. Existing adaptation strategies, such as parameter-efficient fine-tuning, focus on individual models and do not exploit the complementary strengths across models. Probing methods offer a promising alternative by extracting information from frozen models, but current techniques do not scale well with large feature sets and often rely on dataset-specific hyperparameter tuning. We propose Combined backBones (ComBo), a simple and scalable probing-based adapter that effectively integrates features from multiple models and layers. ComBo compresses activations from layers of one or more FMs into compact token-wise representations and processes them with a lightweight transformer for task-specific prediction. Crucially, ComBo does not require dataset-specific tuning or backpropagation through the backbone models. However, not all models are equally relevant for all tasks. To address this, we introduce a mechanism that leverages ComBo's joint multi-backbone probing to efficiently evaluate each backbone's task-relevance, enabling both practical model comparison and improved performance through selective adaptation. On the 19 tasks of the VTAB-1k benchmark, ComBo outperforms previous probing methods, matches or surpasses more expensive alternatives, such as distillation-based model merging, and enables efficient probing of tuned models. Our results demonstrate that ComBo offers a practical and general-purpose framework for combining diverse representations from multiple FMs.
