Table of Contents
Fetching ...

HI-MoE: Hierarchical Instance-Conditioned Mixture-of-Experts for Object Detection

Vadim Vashkelis, Natalia Trukhina

Abstract

Mixture-of-Experts (MoE) architectures enable conditional computation by activating only a subset of model parameters for each input. Although sparse routing has been highly effective in language models and has also shown promise in vision, most vision MoE methods operate at the image or patch level. This granularity is poorly aligned with object detection, where the fundamental unit of reasoning is an object query corresponding to a candidate instance. We propose Hierarchical Instance-Conditioned Mixture-of-Experts (HI-MoE), a DETR-style detection architecture that performs routing in two stages: a lightweight scene router first selects a scene-consistent expert subset, and an instance router then assigns each object query to a small number of experts within that subset. This design aims to preserve sparse computation while better matching the heterogeneous, instance-centric structure of detection. In the current draft, experiments are concentrated on COCO with preliminary specialization analysis on LVIS. Under these settings, HI-MoE improves over a dense DINO baseline and over simpler token-level or instance-only routing variants, with especially strong gains on small objects. We also provide an initial visualization of expert specialization patterns. We present the method, ablations, and current limitations in a form intended to support further experimental validation.

HI-MoE: Hierarchical Instance-Conditioned Mixture-of-Experts for Object Detection

Abstract

Mixture-of-Experts (MoE) architectures enable conditional computation by activating only a subset of model parameters for each input. Although sparse routing has been highly effective in language models and has also shown promise in vision, most vision MoE methods operate at the image or patch level. This granularity is poorly aligned with object detection, where the fundamental unit of reasoning is an object query corresponding to a candidate instance. We propose Hierarchical Instance-Conditioned Mixture-of-Experts (HI-MoE), a DETR-style detection architecture that performs routing in two stages: a lightweight scene router first selects a scene-consistent expert subset, and an instance router then assigns each object query to a small number of experts within that subset. This design aims to preserve sparse computation while better matching the heterogeneous, instance-centric structure of detection. In the current draft, experiments are concentrated on COCO with preliminary specialization analysis on LVIS. Under these settings, HI-MoE improves over a dense DINO baseline and over simpler token-level or instance-only routing variants, with especially strong gains on small objects. We also provide an initial visualization of expert specialization patterns. We present the method, ablations, and current limitations in a form intended to support further experimental validation.

Paper Structure

This paper contains 24 sections, 5 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: HI-MoE overview. A scene router first selects a scene-consistent expert subset; an instance router then performs per-query top-$K$ routing inside that subset. Sparse experts replace selected transformer FFNs.
  • Figure 2: Visualization derived from the expert-level routing statistics in Table \ref{['tab:specialization']}. Left: per-expert subset AP for representative experts and the average row. Right: for each displayed expert, the proportion of that expert's routed assignments associated with its dominant scene-route category (Crowd for E1, Indoor for E3, Outdoor for E6). These percentages are normalized independently per expert and therefore are not additive and are not expected to sum to 100% across the three bars. This figure is intended as a first-step illustration of specialization rather than a complete utilization analysis across all experts and layers.