Table of Contents
Fetching ...

Explore In-Context Segmentation via Latent Diffusion Models

Chaoyang Wang, Xiangtai Li, Henghui Ding, Lu Qi, Jiangning Zhang, Yunhai Tong, Chen Change Loy, Shuicheng Yan

TL;DR

This work demonstrates that latent diffusion models can perform in-context segmentation guided solely by visual prompts, without textual cues or extra refinement networks. By focusing on instruction extraction, output alignment via pseudo-masking, and two meta-architectures (LDIS-1 and LDIS-n), the approach unlocks segmentation capabilities directly within the diffusion framework. The authors introduce a two-stage masking strategy, augmented pseudo-masking, and a new image/video benchmark, showing results that are competitive with specialist and foundation models across image and video tasks. The study highlights the potential of unifying segmentation and generation under diffusion models and provides a practical pathway for broader, data-efficient in-context visual reasoning.

Abstract

In-context segmentation has drawn increasing attention with the advent of vision foundation models. Its goal is to segment objects using given reference images. Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries. This work approaches the problem from a fresh perspective - unlocking the capability of the latent diffusion model (LDM) for in-context segmentation and investigating different design choices. Specifically, we examine the problem from three angles: instruction extraction, output alignment, and meta-architectures. We design a two-stage masking strategy to prevent interfering information from leaking into the instructions. In addition, we propose an augmented pseudo-masking target to ensure the model predicts without forgetting the original images. Moreover, we build a new and fair in-context segmentation benchmark that covers both image and video datasets. Experiments validate the effectiveness of our approach, demonstrating comparable or even stronger results than previous specialist or visual foundation models. We hope our work inspires others to rethink the unification of segmentation and generation.

Explore In-Context Segmentation via Latent Diffusion Models

TL;DR

This work demonstrates that latent diffusion models can perform in-context segmentation guided solely by visual prompts, without textual cues or extra refinement networks. By focusing on instruction extraction, output alignment via pseudo-masking, and two meta-architectures (LDIS-1 and LDIS-n), the approach unlocks segmentation capabilities directly within the diffusion framework. The authors introduce a two-stage masking strategy, augmented pseudo-masking, and a new image/video benchmark, showing results that are competitive with specialist and foundation models across image and video tasks. The study highlights the potential of unifying segmentation and generation under diffusion models and provides a practical pathway for broader, data-efficient in-context visual reasoning.

Abstract

In-context segmentation has drawn increasing attention with the advent of vision foundation models. Its goal is to segment objects using given reference images. Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries. This work approaches the problem from a fresh perspective - unlocking the capability of the latent diffusion model (LDM) for in-context segmentation and investigating different design choices. Specifically, we examine the problem from three angles: instruction extraction, output alignment, and meta-architectures. We design a two-stage masking strategy to prevent interfering information from leaking into the instructions. In addition, we propose an augmented pseudo-masking target to ensure the model predicts without forgetting the original images. Moreover, we build a new and fair in-context segmentation benchmark that covers both image and video datasets. Experiments validate the effectiveness of our approach, demonstrating comparable or even stronger results than previous specialist or visual foundation models. We hope our work inspires others to rethink the unification of segmentation and generation.
Paper Structure (11 sections, 10 equations, 6 figures, 8 tables)

This paper contains 11 sections, 10 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Method comparison. (a) Discriminative models match query images with support prototypes. (b) Masked image modeling methods adopt inpainting training. (c) Our LDM-based model generates segmentation masks guided by visual prompts.
  • Figure 2: Latent diffusion model for in-context segmentation. (a) Previous works mainly rely on textual prompts and additional neural networks for segmentation. (b) Our proposed minimalist framework, LDIS. (c) Segmentation results on images and videos.
  • Figure 3: Our proposed LDIS. Left: Meta-architecture. Our model operates as a minimalist and generates the mask under the guidance of in-context instructions. Right: The two variants of our meta-architecture differ in input formulation, sampling steps, and optimization target. Notations are illustrated in Tab. \ref{['tab:notation']}.
  • Figure 4: Visualization of segmentation results. We compare our LDIS-1 with Painter Painter and PerSAM zhang2023personalize on the COCO dataset.
  • Figure 5: Visualizations at different time steps. LDIS-n captures low-frequency components at the beginning and then generates high-frequency information as the denoising process approaches completion. The number of denoising steps is 20.
  • ...and 1 more figures