Table of Contents
Fetching ...

Efficient Universal Perception Encoder

Chenchen Zhu, Saksham Suri, Cijo Jose, Maxime Oquab, Marc Szafraniec, Wei Wen, Yunyang Xiong, Patrick Labatut, Piotr Bojanowski, Raghuraman Krishnamoorthi, Vikas Chandra

Abstract

Running AI models on smart edge devices can unlock versatile user experiences, but presents challenges due to limited compute and the need to handle multiple tasks simultaneously. This requires a vision encoder with small size but powerful and versatile representations. We present our method, Efficient Universal Perception Encoder (EUPE), which offers both inference efficiency and universally good representations for diverse downstream tasks. We achieve this by distilling from multiple domain-expert foundation vision encoders. Unlike previous agglomerative methods that directly scale down from multiple teachers to an efficient encoder, we demonstrate the importance of first scaling up to a large proxy teacher and then scaling down from this single teacher. Experiments show that EUPE achieves on-par or better performance than individual domain experts of the same size on diverse task domains and also outperforms previous agglomerative encoders. We release the full family of EUPE models and the code to foster future research.

Efficient Universal Perception Encoder

Abstract

Running AI models on smart edge devices can unlock versatile user experiences, but presents challenges due to limited compute and the need to handle multiple tasks simultaneously. This requires a vision encoder with small size but powerful and versatile representations. We present our method, Efficient Universal Perception Encoder (EUPE), which offers both inference efficiency and universally good representations for diverse downstream tasks. We achieve this by distilling from multiple domain-expert foundation vision encoders. Unlike previous agglomerative methods that directly scale down from multiple teachers to an efficient encoder, we demonstrate the importance of first scaling up to a large proxy teacher and then scaling down from this single teacher. Experiments show that EUPE achieves on-par or better performance than individual domain experts of the same size on diverse task domains and also outperforms previous agglomerative encoders. We release the full family of EUPE models and the code to foster future research.
Paper Structure (22 sections, 7 equations, 5 figures, 11 tables)

This paper contains 22 sections, 7 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Applying our distillation recipe (EUPE) to ViT-B gives a well-balanced universal encoder that excels at diverse task domains compared to both ViT-B domain experts and existing agglomerative ViT-Bs. Left: Performance on benchmarks across three task domains, higher the better. IN1k-ZS and IN1k-KNN are image understanding benchmarks on ImageNet1k. TextVQA, SQA, Realworld, GQA, POPE are vision-language modeling tasks. SPair and ADE20k are dense prediction tasks. We omit the IN1k-ZS score for models without text encoder (PEspatial-B, DINOv3-ViT-B, DUNE-B) and the IN1k-KNN score for models without class token output (PEspatial-B). Right: Visualization of EUPE-ViT-B's feature by PCA projection into RGB space.
  • Figure 2: Multi-stage distillation pipeline (scaling up $\rightarrow$ scaling down). In Stage 1 we distill from multiple foundation models into a heavy proxy model. For Stage 2 the distillation happens from the proxy model into the target efficient encoder. In Stage 3 we finetune from Stage 2 models at multiple resolutions. The image pyramid indicates the multi-resolution inputs.
  • Figure 3: Per teacher distillation flow. Snowflake symbol indicates frozen parameters and flame symbol indicates trainable parameters. 2D interpolation is applied to the patch tokens in case the student's output and the teacher's output are of different spatial dimensions.
  • Figure 4: Comparison of dense features by projecting the patch tokens using PCA into RGB space. From left to right: EUPE-ViT-B16 (Ours), DINOv3-ViT-B16, DUNE-B14, RADIOv2.5-B16, SigLIP2-B16, PEcore-B16. Best viewed in color.
  • Figure 5: Comparison of dense features PCA components for encoders trained with stage variants. Input is the same as the image in Fig. \ref{['fig:vis']} row 1. "Stage 2 only" means direct distillation from multiple teachers into the target efficient encoder. Best viewed in color.