Table of Contents
Fetching ...

MGD-SAM2: Multi-view Guided Detail-enhanced Segment Anything Model 2 for High-Resolution Class-agnostic Segmentation

Haoran Shen, Peixian Zhuang, Jiahao Kou, Yuxin Zeng, Haoying Xu, Jiangyun Li

TL;DR

High-resolution class-agnostic segmentation remains challenging for standard vision foundations like SAM due to the need for fine-grained detail and global context. The paper introduces MGD-SAM2, which fuses multi-view information by processing a global resized image and four local patches through four modules—MPAdapter, MCEM, HMIM, and DRM—to adapt SAM2 for high-fidelity HRCS. A cross-view, multi-scale fusion strategy paired with a BCE+IoU training objective and auxiliary $M_s$ supervision yields state-of-the-art results on DIS5K, HRSOD, UHRSD/DAVIS-S, and SOD datasets, while maintaining efficiency with limited trainable parameters. The approach demonstrates strong generalization across diverse HRCS tasks and datasets, offering a practical pathway to accurate, high-resolution class-agnostic segmentation with SAM2 as a prior.

Abstract

Segment Anything Models (SAMs), as vision foundation models, have demonstrated remarkable performance across various image analysis tasks. Despite their strong generalization capabilities, SAMs encounter challenges in fine-grained detail segmentation for high-resolution class-independent segmentation (HRCS), due to the limitations in the direct processing of high-resolution inputs and low-resolution mask predictions, and the reliance on accurate manual prompts. To address these limitations, we propose MGD-SAM2 which integrates SAM2 with multi-view feature interaction between a global image and local patches to achieve precise segmentation. MGD-SAM2 incorporates the pre-trained SAM2 with four novel modules: the Multi-view Perception Adapter (MPAdapter), the Multi-view Complementary Enhancement Module (MCEM), the Hierarchical Multi-view Interaction Module (HMIM), and the Detail Refinement Module (DRM). Specifically, we first introduce MPAdapter to adapt the SAM2 encoder for enhanced extraction of local details and global semantics in HRCS images. Then, MCEM and HMIM are proposed to further exploit local texture and global context by aggregating multi-view features within and across multi-scales. Finally, DRM is designed to generate gradually restored high-resolution mask predictions, compensating for the loss of fine-grained details resulting from directly upsampling the low-resolution prediction maps. Experimental results demonstrate the superior performance and strong generalization of our model on multiple high-resolution and normal-resolution datasets. Code will be available at https://github.com/sevenshr/MGD-SAM2.

MGD-SAM2: Multi-view Guided Detail-enhanced Segment Anything Model 2 for High-Resolution Class-agnostic Segmentation

TL;DR

High-resolution class-agnostic segmentation remains challenging for standard vision foundations like SAM due to the need for fine-grained detail and global context. The paper introduces MGD-SAM2, which fuses multi-view information by processing a global resized image and four local patches through four modules—MPAdapter, MCEM, HMIM, and DRM—to adapt SAM2 for high-fidelity HRCS. A cross-view, multi-scale fusion strategy paired with a BCE+IoU training objective and auxiliary supervision yields state-of-the-art results on DIS5K, HRSOD, UHRSD/DAVIS-S, and SOD datasets, while maintaining efficiency with limited trainable parameters. The approach demonstrates strong generalization across diverse HRCS tasks and datasets, offering a practical pathway to accurate, high-resolution class-agnostic segmentation with SAM2 as a prior.

Abstract

Segment Anything Models (SAMs), as vision foundation models, have demonstrated remarkable performance across various image analysis tasks. Despite their strong generalization capabilities, SAMs encounter challenges in fine-grained detail segmentation for high-resolution class-independent segmentation (HRCS), due to the limitations in the direct processing of high-resolution inputs and low-resolution mask predictions, and the reliance on accurate manual prompts. To address these limitations, we propose MGD-SAM2 which integrates SAM2 with multi-view feature interaction between a global image and local patches to achieve precise segmentation. MGD-SAM2 incorporates the pre-trained SAM2 with four novel modules: the Multi-view Perception Adapter (MPAdapter), the Multi-view Complementary Enhancement Module (MCEM), the Hierarchical Multi-view Interaction Module (HMIM), and the Detail Refinement Module (DRM). Specifically, we first introduce MPAdapter to adapt the SAM2 encoder for enhanced extraction of local details and global semantics in HRCS images. Then, MCEM and HMIM are proposed to further exploit local texture and global context by aggregating multi-view features within and across multi-scales. Finally, DRM is designed to generate gradually restored high-resolution mask predictions, compensating for the loss of fine-grained details resulting from directly upsampling the low-resolution prediction maps. Experimental results demonstrate the superior performance and strong generalization of our model on multiple high-resolution and normal-resolution datasets. Code will be available at https://github.com/sevenshr/MGD-SAM2.

Paper Structure

This paper contains 29 sections, 11 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Comparison between the proposed MGD-SAM2 and other existing methods. (a) Common frameworkromera2017erfnetzhang2021lookingwang2022dual; (b) Image pyramid-based methodsxie2022pyramidkim2022revisiting; (c) Patch-based methodszeng2019towardsyu2024multi; (d) SAMs' frameworkkirillov2023segmentravi2024sam; (e) Adapter based SAMschen2023samgao2024multichen2024sam2; (f) MGD-SAM2: leveraging multi-view complementarity with SAM2's rich prior knowledge for high-resolution class- agnostic segmentation.
  • Figure 2: Overall architecture of the proposed MGD-SAM2, which integrates the pre-trained SAM2 with four novel modules: Multi-view Perception Adapter (MPAdapter), Multi-view Complementary Enhancement Module (MCEM), Hierarchical Multi-view Interaction Module (HMIM), and Detail Refinement Module (DRM). We take the combination of image patches and the resized raw image as the multi-view input. Firstly, we employ the MPAdapter-assisted SAM2 encoder to extract multi-stage features suitable for the HRCS task. Then, MCEM and HMIM are proposed to utilize multi-scale and multi-view features, further enhancing the global semantics and local texture. Finally, DRM is proposed to generate the gradually restored high-resolution prediction. Both the upsampled mask prediction $M_s$ from SAM2 decoder and the final mask prediction $M_p$ are supervised. The detailed presentation of each module is shown in Section. \ref{['sec_mpadapter']}, \ref{['sec_mcem']}, \ref{['sec_hmim']}, \ref{['sec_drm']}.
  • Figure 3: Details of the MPAdapter. To adapt SAM2 to HRCS task in an effective and efficient way, we insert the trainable MPAdapter into each Transformer block of the SAM2's frozen encoder. Based on the Adapter, MPAdapter utilizes two lightweight 3D depth-wise convolutions to enhance local and global features respectively.
  • Figure 4: Illustration of the MCEM. After the Two Way Transformer, we employ MCEM to further refine the local and global localization of deep feature $F_{16}$, which utilizes cross-attention to establish the long-range connections between multi-view features.
  • Figure 5: Illustration of the HMIM. Before directly using the deep feature $E_{16}$ and shallow features $\{F_4,F_8\}$ to obtain SAM2's mask prediction, we devise HMIM to leverage multi-scale and multi-view features to improve the local detail and global context of shallow features.
  • ...and 6 more figures