Table of Contents
Fetching ...

Foundry: Distilling 3D Foundation Models for the Edge

Guillaume Letellier, Siddharth Srivastava, Frédéric Jurie, Gaurav Sharma

TL;DR

This work tackles the bottleneck of deploying large self-supervised 3D foundation models on edge hardware by introducing Foundation Model Distillation (FMD) and Foundry, a framework that distills a teacher’s entire latent space into a compact set of learnable SuperTokens. Through a compress-and-reconstruct objective, a lightweight student learns a transferable, general-purpose 3D backbone that maintains strong performance across classification, segmentation, and few-shot tasks while substantially reducing tokens and FLOPs. The approach combines Dynamic Supertoken Optimization (DSO) and Cross-Attention Upsampling (CAU) to compress latent representations and faithfully reconstruct them, with an optional budget-aware gate to enable on-the-fly compute control. Empirical results show that Foundry closely matches teacher performance on many benchmarks, offers robust transfer capabilities, and enables edge-scale deployment with meaningful reductions in computation, memory, and latency, marking a path toward practical 3D foundation-model proxies for resource-constrained devices.

Abstract

Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient 'specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher's token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.

Foundry: Distilling 3D Foundation Models for the Edge

TL;DR

This work tackles the bottleneck of deploying large self-supervised 3D foundation models on edge hardware by introducing Foundation Model Distillation (FMD) and Foundry, a framework that distills a teacher’s entire latent space into a compact set of learnable SuperTokens. Through a compress-and-reconstruct objective, a lightweight student learns a transferable, general-purpose 3D backbone that maintains strong performance across classification, segmentation, and few-shot tasks while substantially reducing tokens and FLOPs. The approach combines Dynamic Supertoken Optimization (DSO) and Cross-Attention Upsampling (CAU) to compress latent representations and faithfully reconstruct them, with an optional budget-aware gate to enable on-the-fly compute control. Empirical results show that Foundry closely matches teacher performance on many benchmarks, offers robust transfer capabilities, and enables edge-scale deployment with meaningful reductions in computation, memory, and latency, marking a path toward practical 3D foundation-model proxies for resource-constrained devices.

Abstract

Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient 'specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher's token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.

Paper Structure

This paper contains 71 sections, 14 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: Overview of the Foundry Framework. Our distillation strategy follows a compress-and-reconstruct pipeline. The student model uses a Dynamic Supertoken Optimization (DSO) module to compress the input tokens into a small set of learnable SuperTokens. After processing by a lightweight encoder, a Cross-Attention Upsampling (CAU) module reconstructs an approximation of the teacher's latent space. The entire student is trained to minimize the reconstruction error, forcing the SuperTokens to become a powerful, compact basis for the teacher's representations.
  • Figure 2: Relationship between distillation loss and fine-tuning accuracy. Each point corresponds to a distilled model with a different number of SuperTokens ($s$) across multiple datasets. A clear inverse correlation is observed, with diminishing improvements beyond $s{\leq}4$, indicating that reconstruction fidelity saturates once the student sufficiently spans the teacher’s latent space.
  • Figure 3: Finetuning results vs token budget for Foundry-Gate. We use the best backbone on each dataset with $\lambda_{\text{gate}}=10^{-11}$ and no weight decay on the gate module. The token budget represents the maximum number of tokens we accept as input to the core encoder ($r$ is dynamic and changes for each input example). We report top-1 overall accuracy for classification tasks and mIoU per class and per instance for part segmentation tasks. All metrics are computed on the validation set for each dataset.
  • Figure 4: Comparison on distillation loss. We use Foundry with a loss update from Smooth-L1 to MSE. To show differences, we show validation top-1 accuracy curves during fine-tuning on ShapeNet55 after distillation using the specified loss.
  • Figure 5: Fine-tuning results vs $\lambda_{\text{gate}}$. No weight decay used on the gate module during distillation are shown. All metrics have been computed on the validation set for each dataset.
  • ...and 10 more figures