Table of Contents
Fetching ...

Distilled Large Language Model-Driven Dynamic Sparse Expert Activation Mechanism

Qinghui Chen, Zekai Zhang, Zaigui Zhang, Kai Zhang, Dagang Li, Wenmin Wang, Jinglin Zhang, Cong Liu

Abstract

High inter-class similarity, extreme scale variation, and limited computational budgets hinder reliable visual recognition across diverse real-world data. Existing vision-centric and cross-modal approaches often rely on rigid fusion mechanisms and heavy annotation pipelines, leading to sub-optimal generalization. We propose the Distilled Large Language Model (LLM)-Driven Sparse Mixture-of-Experts (DS-MoE) framework, which integrates text-guided dynamic routing and lightweight multi-scale comprehension. The DS-MoE framework dynamically aligns textual semantics with defect-specific visual patterns through a sparse MoE architecture, where task-relevant experts are adaptively activated based on semantic relevance, resolving inter-class ambiguity. A lightweight MobileSAM encoder enables real-time inference while preserving multi-scale defect details. Extensive experiments on PCB, aluminum foil, and mold defect datasets demonstrate that our framework achieves superior performance compared to existing pure vision models. \textbf{DS-MoE} surpasses YOLOv8/YOLOX with gains of +13.9, +1.4, and +2.0 pp mAP@ 0.5:0.95 on BBMP, aluminum, and PCB, respectively, while also improving precision and recall.

Distilled Large Language Model-Driven Dynamic Sparse Expert Activation Mechanism

Abstract

High inter-class similarity, extreme scale variation, and limited computational budgets hinder reliable visual recognition across diverse real-world data. Existing vision-centric and cross-modal approaches often rely on rigid fusion mechanisms and heavy annotation pipelines, leading to sub-optimal generalization. We propose the Distilled Large Language Model (LLM)-Driven Sparse Mixture-of-Experts (DS-MoE) framework, which integrates text-guided dynamic routing and lightweight multi-scale comprehension. The DS-MoE framework dynamically aligns textual semantics with defect-specific visual patterns through a sparse MoE architecture, where task-relevant experts are adaptively activated based on semantic relevance, resolving inter-class ambiguity. A lightweight MobileSAM encoder enables real-time inference while preserving multi-scale defect details. Extensive experiments on PCB, aluminum foil, and mold defect datasets demonstrate that our framework achieves superior performance compared to existing pure vision models. \textbf{DS-MoE} surpasses YOLOv8/YOLOX with gains of +13.9, +1.4, and +2.0 pp mAP@ 0.5:0.95 on BBMP, aluminum, and PCB, respectively, while also improving precision and recall.

Paper Structure

This paper contains 23 sections, 35 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Example of dataset for industrial detection. The first line and second line serve as the bottle-bottom mold point image and PCB defects image respectively. The aluminium defects image is shown in the third line. According to that, it is necessary to design an object detection in the industrial quality detection, especially for those with higher similarities and greater ranges of size.
  • Figure 2: DS-MoE Framework. DeepSeek-R1 generates defect-specific text prompts. MobileSAM encoder extracts multi-scale visual features. Hyperbolic manifold alignment fuses text and vision in curvature-aware space. Dynamic sparse MoE gates (top-k experts) activate task-relevant visual experts for fine-grained defect analysis. Dual-branch head outputs simultaneous classification and localization.
  • Figure 3: Hyperbolic Manifold Alignment. In the Poincaré ball, visual features are first lifted onto the manifold via the exponential map, while distilled text embeddings are anchored in the same space. Their logarithmic mappings are then fused with a learnable weight, preserving both the global defect taxonomy and local semantic nuances. The resulting geometrically aligned features enable downstream curvature-aware convolutional sampling.
  • Figure 4: A concise flowchart of Stages 8–10. Sparse MoE dynamically routes each input to only $\lfloor\log_2 N_e\rfloor$ experts: two task-specific modules (anisotropic local patterns and global structure) and selected cross-modal experts that fuse vision with replicated text embeddings. Their outputs are ensembled via dilated convolutions and reweighted by channel-wise attention, yielding a compact feature map.
  • Figure 5: The glass bottle bottom detection device collects images. During this process, the glass bottle will be suspended by the conveyor belts on both sides and sent to the photographing. The photoelectric gate receives the signal to trigger the light source and the camera at the bottom to take pictures.
  • ...and 5 more figures