Table of Contents
Fetching ...

Moondream Segmentation: From Words to Masks

Ethan Reid

Abstract

We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).

Moondream Segmentation: From Words to Masks

Abstract

We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).

Paper Structure

This paper contains 69 sections, 21 equations, 11 figures, 3 tables, 3 algorithms.

Figures (11)

  • Figure 1: Example masks produced by Moondream Segmentation. Prompts are shown in white boxes.
  • Figure 2: High-level overview of Moondream Segmentation. The VLM decodes a vector path from the image and prompt, which is rasterized into a coarse mask. An iterative refiner conditioned on frozen vision features produces the final mask.
  • Figure 3: Training data generation pipeline. Web images are labeled by an ensemble of VLMs with text annotations and bounding boxes, verified by Moondream, filtered for consistency and accuracy, and passed to a segmentation model to propose masks. Surviving image--text--box--mask tuples are added to the final dataset; rejected samples are discarded.
  • Figure 4: Original RefCOCO polygon masks (top) and RefCOCO-M refined masks (bottom). RefCOCO-M tightens boundaries and recovers fine structure that is often missing from the original annotations.
  • Figure 5: Boundary-focused qualitative comparison (prompt: car). Moondream masks are typically sharper at edges and better preserve fine structure than SAM 3.
  • ...and 6 more figures