Table of Contents
Fetching ...

AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One

Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov

TL;DR

AM-RADIO addresses the challenge of leveraging multiple vision foundation models by distilling their complementary capabilities into a single student encoder. Using a multi-teacher distillation framework with per-teacher adaptor heads and a feature-centric loss, the authors fuse zero-shot language grounding (CLIP), dense spatial features (DINOv2), and open-vocabulary segmentation (SAM) into one model. They introduce E-RADIO, a hybrid CNN-Transformer backbone that achieves substantial throughput gains while preserving accuracy, and demonstrate strong performance across ImageNet, ADE20K, COCO, and LLaVA pipelines. The results show the unified RADIO models often outperform their teachers, with E-RADIO delivering the best speed/quality trade-offs and capable of drop-in integration with existing systems. This work offers a practical pathway to compact, versatile backbones that combine the strengths of multiple foundation-model families for broad downstream applicability.

Abstract

A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are trained with distinct objectives, exhibiting unique characteristics for various downstream tasks. We find that despite their conceptual differences, these models can be effectively merged into a unified model through multi-teacher distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One). This integrative approach not only surpasses the performance of individual teacher models but also amalgamates their distinctive features, such as zero-shot vision-language comprehension, detailed pixel-level understanding, and open vocabulary segmentation capabilities. In pursuit of the most hardware-efficient backbone, we evaluated numerous architectures in our multi-teacher distillation pipeline using the same training recipe. This led to the development of a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 7x faster than the teacher models. Our comprehensive benchmarking process covers downstream tasks including ImageNet classification, ADE20k semantic segmentation, COCO object detection and LLaVa-1.5 framework. Code: https://github.com/NVlabs/RADIO

AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One

TL;DR

AM-RADIO addresses the challenge of leveraging multiple vision foundation models by distilling their complementary capabilities into a single student encoder. Using a multi-teacher distillation framework with per-teacher adaptor heads and a feature-centric loss, the authors fuse zero-shot language grounding (CLIP), dense spatial features (DINOv2), and open-vocabulary segmentation (SAM) into one model. They introduce E-RADIO, a hybrid CNN-Transformer backbone that achieves substantial throughput gains while preserving accuracy, and demonstrate strong performance across ImageNet, ADE20K, COCO, and LLaVA pipelines. The results show the unified RADIO models often outperform their teachers, with E-RADIO delivering the best speed/quality trade-offs and capable of drop-in integration with existing systems. This work offers a practical pathway to compact, versatile backbones that combine the strengths of multiple foundation-model families for broad downstream applicability.

Abstract

A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are trained with distinct objectives, exhibiting unique characteristics for various downstream tasks. We find that despite their conceptual differences, these models can be effectively merged into a unified model through multi-teacher distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One). This integrative approach not only surpasses the performance of individual teacher models but also amalgamates their distinctive features, such as zero-shot vision-language comprehension, detailed pixel-level understanding, and open vocabulary segmentation capabilities. In pursuit of the most hardware-efficient backbone, we evaluated numerous architectures in our multi-teacher distillation pipeline using the same training recipe. This led to the development of a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 7x faster than the teacher models. Our comprehensive benchmarking process covers downstream tasks including ImageNet classification, ADE20k semantic segmentation, COCO object detection and LLaVa-1.5 framework. Code: https://github.com/NVlabs/RADIO
Paper Structure (28 sections, 7 equations, 13 figures, 10 tables)

This paper contains 28 sections, 7 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: AM-RADIO is a framework to distill multiple pretrained vision foundation models, such as CLIP radford2021clip, DINOv2oquab2023dinov2, SAM kirillov2023sam, into a single model that we call RADIO. As a result, a single vision foundation model agglomerates unique properties of the original models. This unifying approach obtains state-of-the-art feature representations in a single forward pass while also enabling unique properties such as zero-shot classification (CLIP) or open set instance segmentation (SAM) at negligible additional cost. Image description: (left) PCA feature visualization of different models. Our proposed RADIO model can process any resolution and aspect ratio, and produces semantically rich dense encodings; (middle) the overview of the AM-RADIO framework; (right) benchmarks on classification, segmentation, and vision-language modeling tasks, see section \ref{['sec:results']}.
  • Figure 2: AM-RADIO - is a multi-teacher distillation framework that efficiently trains new vision foundation models of arbitrary architecture. It unifies unique attributes (like zero-shot text grounding, dense correspondence) of each teacher into a single model that even outperforms them on a majority of the tasks.
  • Figure 3: PCA visualization of the position embeddings for various models. The CPE method not only allows RADIO to learn an arbitrarily large absolution position embedding map, but also goes a long way towards regularizing the space and eliminating high frequency artifacts. As seen with the other models, position embeddings normally have regular frequency patterns, leading to undesirable output artifacts from the ViT yang2024denoisingyang2023emernerfbolya2023window.
  • Figure 4: All models followed the same training protocol. The results from three benchmarks show that RADIO and E-RADIO models outperform others in efficiency. This under-performance in other models might be due to overfitting architectures on supervised ImageNet-1K training. E-RADIO notably delivers results 10 times faster and with a 20% improvement over teacher models. We study E-RADIO at 224px resolution, with a window size of 7.
  • Figure 5: RADIO "mode switches" when resolution is increased. In the plot, we show the MSE error between the RADIO features coming from its DINOv2 head at different resolutions, versus the features actually produced by DINOv2 at 518px. We bilinearly interpolate the RADIO features to match the DINOv2 feature resolution. At 720px, there is a sudden jump in the error, which corresponds with a complete change in color space in the image.
  • ...and 8 more figures