Table of Contents
Fetching ...

Multi-view Aggregation Network for Dichotomous Image Segmentation

Qian Yu, Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu

TL;DR

This work tackles high-precision dichotomous image segmentation (DIS) by reframing it as a multi-view perception problem. It introduces MVANet, a single-stream, single-stage architecture that processes a distant global view and multiple close-up views within a unified encoder–decoder, guided by two novel modules: multi-view complementary localization (MCLM) and multi-view complementary refinement (MCRM). The view rearrangement module fuses global and local information to produce a high-resolution segmentation, while a multi-objective loss provides strong supervision across stages. Empirically, MVANet achieves state-of-the-art results on the DIS-5K benchmark with superior accuracy and faster inference, demonstrating the practical benefit of integrating global context with detailed local cues in a compact architecture.

Abstract

Dichotomous Image Segmentation (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images. When designing an effective DIS model, the main challenge is how to balance the semantic dispersion of high-resolution targets in the small receptive field and the loss of high-precision details in the large receptive field. Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement. Human visual system captures regions of interest by observing them from multiple views. Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet), which unifies the feature fusion of the distant view and close-up view into a single stream with one encoder-decoder structure. With the help of the proposed multi-view complementary localization and refinement modules, our approach established long-range, profound visual interactions across multiple views, allowing the features of the detailed close-up view to focus on highly slender structures.Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed. The source code and datasets will be publicly available at \href{https://github.com/qianyu-dlut/MVANet}{MVANet}.

Multi-view Aggregation Network for Dichotomous Image Segmentation

TL;DR

This work tackles high-precision dichotomous image segmentation (DIS) by reframing it as a multi-view perception problem. It introduces MVANet, a single-stream, single-stage architecture that processes a distant global view and multiple close-up views within a unified encoder–decoder, guided by two novel modules: multi-view complementary localization (MCLM) and multi-view complementary refinement (MCRM). The view rearrangement module fuses global and local information to produce a high-resolution segmentation, while a multi-objective loss provides strong supervision across stages. Empirically, MVANet achieves state-of-the-art results on the DIS-5K benchmark with superior accuracy and faster inference, demonstrating the practical benefit of integrating global context with detailed local cues in a compact architecture.

Abstract

Dichotomous Image Segmentation (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images. When designing an effective DIS model, the main challenge is how to balance the semantic dispersion of high-resolution targets in the small receptive field and the loss of high-precision details in the large receptive field. Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement. Human visual system captures regions of interest by observing them from multiple views. Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet), which unifies the feature fusion of the distant view and close-up view into a single stream with one encoder-decoder structure. With the help of the proposed multi-view complementary localization and refinement modules, our approach established long-range, profound visual interactions across multiple views, allowing the features of the detailed close-up view to focus on highly slender structures.Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed. The source code and datasets will be publicly available at \href{https://github.com/qianyu-dlut/MVANet}{MVANet}.
Paper Structure (16 sections, 12 equations, 4 figures, 4 tables)

This paper contains 16 sections, 12 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Process of decomposing high-resolution image into multi-view patch sequence.
  • Figure 2: Overall framework of the proposed MVANet. The downsampled original image and non-overlapping local patches are adopted as inputs for the global context and detailed cues, representing distant and close-up views, respectively. To enhance object localization and achieve detailed depiction, we propose multi-view complementary localization module (MCLM) and refinement module (MCRM), respectively. Besides, a view rearrangement module is introduced to integrate multiple views, thereby generating predictions with highly accurate dominant areas while preserving detailed object structures. The red dashed box indicates the location that is deeply supervised.
  • Figure 3: Pipeline of the proposed multi-view complementary localization and refinement modules. represents the multi-granularity pooling operation.
  • Figure 4: Visual comparison of different DIS methods.