Foundry: Distilling 3D Foundation Models for the Edge
Guillaume Letellier, Siddharth Srivastava, Frédéric Jurie, Gaurav Sharma
TL;DR
This work tackles the bottleneck of deploying large self-supervised 3D foundation models on edge hardware by introducing Foundation Model Distillation (FMD) and Foundry, a framework that distills a teacher’s entire latent space into a compact set of learnable SuperTokens. Through a compress-and-reconstruct objective, a lightweight student learns a transferable, general-purpose 3D backbone that maintains strong performance across classification, segmentation, and few-shot tasks while substantially reducing tokens and FLOPs. The approach combines Dynamic Supertoken Optimization (DSO) and Cross-Attention Upsampling (CAU) to compress latent representations and faithfully reconstruct them, with an optional budget-aware gate to enable on-the-fly compute control. Empirical results show that Foundry closely matches teacher performance on many benchmarks, offers robust transfer capabilities, and enables edge-scale deployment with meaningful reductions in computation, memory, and latency, marking a path toward practical 3D foundation-model proxies for resource-constrained devices.
Abstract
Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient 'specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher's token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.
