Table of Contents
Fetching ...

YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection

Ori Meiraz, Sharon Shalev, Avishai Weizman

TL;DR

This work addresses robustness in object detection by integrating a Mixture-of-Experts (MoE) framework into YOLOv9-T, enabling adaptive routing among specialized detectors. The proposed architecture employs I=3 multi-scale feature maps and E=2 experts, with routers at each scale performing a Hadamard-based fusion to generate normalized routing weights α_i for each expert, and a load-balancing loss L_{lb} to prevent expert collapse: L = L_{det} + λ_{lb} L_{lb}. The MoE routing allows dynamic feature-level specialization, leading to improved mean Average Precision and Average Recall on COCO and VisDrone datasets, including multi-dataset and combined training scenarios (e.g., COCO+Vis with mAP up to 37.5 and AR up to 50.0). The work demonstrates the practicality of MoE for object detection and suggests future extensions to larger YOLO variants, more efficient routing, and temporal or multi-modal video applications.

Abstract

This paper presents a novel Mixture-of-Experts framework for object detection, incorporating adaptive routing among multiple YOLOv9-T experts to enable dynamic feature specialization and achieve higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.

YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection

TL;DR

This work addresses robustness in object detection by integrating a Mixture-of-Experts (MoE) framework into YOLOv9-T, enabling adaptive routing among specialized detectors. The proposed architecture employs I=3 multi-scale feature maps and E=2 experts, with routers at each scale performing a Hadamard-based fusion to generate normalized routing weights α_i for each expert, and a load-balancing loss L_{lb} to prevent expert collapse: L = L_{det} + λ_{lb} L_{lb}. The MoE routing allows dynamic feature-level specialization, leading to improved mean Average Precision and Average Recall on COCO and VisDrone datasets, including multi-dataset and combined training scenarios (e.g., COCO+Vis with mAP up to 37.5 and AR up to 50.0). The work demonstrates the practicality of MoE for object detection and suggests future extensions to larger YOLO variants, more efficient routing, and temporal or multi-modal video applications.

Abstract

This paper presents a novel Mixture-of-Experts framework for object detection, incorporating adaptive routing among multiple YOLOv9-T experts to enable dynamic feature specialization and achieve higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.

Paper Structure

This paper contains 4 sections, 5 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Visualization of MoE within the YOLOv9 architecture, multiple experts process the input image to produce multi-scale feature maps and outputs (class and bounding box logits). Routers at different resolutions (8×8, 16×16, 32×32) generate adaptive routing weights that fuse expert outputs into final detections. The loss is computed between model outputs before Non-maximum suppression (NMS) Hosang2017LearningNSwang2024yolov9 and the ground truth, ensuring end-to-end differentiability.