EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Longfei Liu; Yongjie Hou; Yang Li; Qirui Wang; Youyang Sha; Yongjun Yu; Yinzhi Wang; Peizhe Ru; Xuanlong Yu; Xi Shen

EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Longfei Liu, Yongjie Hou, Yang Li, Qirui Wang, Youyang Sha, Yongjun Yu, Yinzhi Wang, Peizhe Ru, Xuanlong Yu, Xi Shen

Abstract

Deploying high-performance dense prediction models on resource-constrained edge devices remains challenging due to strict limits on computation and memory. In practice, lightweight systems for object detection, instance segmentation, and pose estimation are still dominated by CNN-based architectures such as YOLO, while compact Vision Transformers (ViTs) often struggle to achieve similarly strong accuracy efficiency tradeoff, even with large scale pretraining. We argue that this gap is largely due to insufficient task specific representation learning in small scale ViTs, rather than an inherent mismatch between ViTs and edge dense prediction. To address this issue, we introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder decoder design. On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF-DETR while using substantially fewer parameters. For pose estimation, ECPose-X reaches 74.8 AP, significantly outperforming YOLO26Pose-X (71.6 AP) despite the latter's reliance on extensive Objects365 pretraining. These results show that compact ViTs, when paired with task-specialized distillation and edge-aware design, can be a practical and competitive option for edge dense prediction. Code is available at: https://intellindust-ai-lab.github.io/projects/EdgeCrafter/

EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Abstract

Paper Structure (49 sections, 9 equations, 3 figures, 13 tables)

This paper contains 49 sections, 9 equations, 3 figures, 13 tables.

Introduction
Related Work
Knowledge distillation from vision foundation models.
Efficient object detection.
Efficient instance segmentation.
Efficient human pose estimation.
Method
Overview
ECDet Architecture
Backbone design.
Multi-scale feature generation.
Encoder.
Decoder.
Training objective.
Model scaling and architecture details.
...and 34 more sections

Figures (3)

Figure 1: Comprehensive evaluation of EdgeCrafter. (a) Comparison with state-of-the-art methods across multiple vision tasks on COCO lin2014microsoft. The plots show model parameters (top row) and FLOPs (bottom row) versus mAP. Methods marked with $^{*}$ are pre-trained on the Objects365 dataset shao2019objects365. From left to right, the columns correspond to object detection, human pose estimation, and instance segmentation. (b) Analysis of different pretraining strategies for the backbone, based on the ECDet-T model. ViT-T (Tiny) follows the training strategy proposed in steiner2022train on ImageNet-21K ridnik2021imagenet. In our experiments, supervised ImageNet-21K pretraining is weaker than no pretraining for this compact model, consistent with observations reported by ghiasi2020simplezoph20selftraining. Task-specialized distillation yields substantially stronger downstream performance.
Figure 2: Overview of the EdgeCrafter pipeline.Stage 1: A pretrained DINOv3 backbone simeoni2025dinov3 is adapted to object detection to create a task-specialized teacher within the ECDet formulation. Stage 2: The resulting teacher distills its detection-oriented representation into compact ECViT student backbones through feature alignment on a large image collection. Stage 3: The distilled students are used to instantiate the ECDet model family at different scales (S/M/L/X), and the same distilled backbone and encoder are further reused for instance segmentation and human pose estimation with lightweight task-specific heads. The key idea is that detection serves as the representation-learning stage, while the learned backbone transfers directly to other dense prediction tasks.
Figure 3: Architecture of ECDet. ECDet consists of three components: a distilled ECViT backbone, an encoder, and a decoder. The backbone replaces the standard large-stride patch embedding with a four-layer convolutional stem and outputs a single-resolution token representation. A lightweight multi-scale feature generator then aggregates the final transformer blocks and produces feature maps at strides $8$, $16$, and $32$ with interpolation and $1 \times 1$ projections. The encoder refines and fuses these features, and the decoder performs set-based object prediction from learned object queries. The overall design keeps the detector compact while preserving the multi-scale structure required for dense localization.

EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Abstract

EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Authors

Abstract

Table of Contents

Figures (3)