Table of Contents
Fetching ...

Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation

Alaa Dalaq, Muzammil Behzad

Abstract

Referring image segmentation aims to produce a pixel-level mask for the image region described by a natural-language expression. Although pretrained vision-language models have improved semantic grounding, many existing methods still rely on uniform refinement strategies that do not fully match the diverse reasoning requirements of referring expressions. Because of this mismatch, predictions often contain fragmented regions, inaccurate boundaries, or even the wrong object, especially when pretrained backbones are frozen for computational efficiency. To address these limitations, we propose SERA, a Spatio-Semantic Expert Routing Architecture for referring image segmentation. SERA introduces lightweight, expression-aware expert refinement at two complementary stages within a vision-language framework. First, we design SERA-Adapter, which inserts an expression-conditioned adapter into selected backbone blocks to improve spatial coherence and boundary precision through expert-guided refinement and cross-modal attention. We then introduce SERA-Fusion, which strengthens intermediate visual representations by reshaping token features into spatial grids and applying geometry-preserving expert transformations before multimodal interaction. In addition, a lightweight routing mechanism adaptively weights expert contributions while remaining compatible with pretrained representations. To make this routing stable under frozen encoders, SERA uses a parameter-efficient tuning strategy that updates only normalization and bias terms, affecting less than 1% of the backbone parameters. Experiments on standard referring image segmentation benchmarks show that SERA consistently outperforms strong baselines, with especially clear gains on expressions that require accurate spatial localization and precise boundary delineation.

Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation

Abstract

Referring image segmentation aims to produce a pixel-level mask for the image region described by a natural-language expression. Although pretrained vision-language models have improved semantic grounding, many existing methods still rely on uniform refinement strategies that do not fully match the diverse reasoning requirements of referring expressions. Because of this mismatch, predictions often contain fragmented regions, inaccurate boundaries, or even the wrong object, especially when pretrained backbones are frozen for computational efficiency. To address these limitations, we propose SERA, a Spatio-Semantic Expert Routing Architecture for referring image segmentation. SERA introduces lightweight, expression-aware expert refinement at two complementary stages within a vision-language framework. First, we design SERA-Adapter, which inserts an expression-conditioned adapter into selected backbone blocks to improve spatial coherence and boundary precision through expert-guided refinement and cross-modal attention. We then introduce SERA-Fusion, which strengthens intermediate visual representations by reshaping token features into spatial grids and applying geometry-preserving expert transformations before multimodal interaction. In addition, a lightweight routing mechanism adaptively weights expert contributions while remaining compatible with pretrained representations. To make this routing stable under frozen encoders, SERA uses a parameter-efficient tuning strategy that updates only normalization and bias terms, affecting less than 1% of the backbone parameters. Experiments on standard referring image segmentation benchmarks show that SERA consistently outperforms strong baselines, with especially clear gains on expressions that require accurate spatial localization and precise boundary delineation.
Paper Structure (37 sections, 30 equations, 13 figures, 3 tables)

This paper contains 37 sections, 30 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Overview of the proposed SERA. The left side of the figure shows visual-textual examples that illustrate typical challenges in referring image segmentation, along with grammatically challenging prompts that involve fine-grained attribute reasoning, relational grounding, and complex object boundaries. The middle part of the figure shows the SERA architecture, in which a router dynamically assigns routing weights to multiple experts within an expert-guided refinement module. The right side of the figure shows qualitative segmentation results produced by SERA, demonstrating improved alignment between referring expressions and predicted object regions. Ground-truth masks are shown at the bottom for reference.
  • Figure 2: Architecture of SERA for referring image segmentation. Visual features from a frozen DINOv2 backbone are refined using SERA-Adapter and fused with CLIP text embeddings via SERA-Fusion before decoding the final segmentation mask.
  • Figure 3: Detailed architecture of the proposed SERA-Adapter. Visual tokens from the backbone are projected into a spatial grid and processed by multi-scale convolutional branches to enrich local context. The resulting features are refined by two specialized experts, a spatial expert and a boundary expert, whose outputs are adaptively weighted by a lightweight router. The refined visual features are then aligned with text representations through cross-modal attention and incorporated into the backbone via a residual update.
  • Figure 4: Architecture of the proposed SERA-Fusion module. Intermediate visual feature maps are refined using a set of specialized experts targeting complementary cues, including spatial layout, boundary structure, contextual interaction, and global shape consistency. A lightweight routing network computes input-dependent expert weights from pooled spatial features and aggregates the selected expert outputs through weighted fusion. The refined representation preserves spatial resolution and feeds into the segmentation decoder.
  • Figure 5: Qualitative comparison on RefCOCO. Each row corresponds to a referring expression, with columns showing the original image, ground-truth mask, DETRIS prediction, and SERA (ours) prediction.
  • ...and 8 more figures