Table of Contents
Fetching ...

HAISTA-NET: Human Assisted Instance Segmentation Through Attention

Muhammed Korkmaz, T. Metin Sezgin

TL;DR

This work proposes a human-assisted segmentation model, HAISTA-NET, which augments the existing Strong Mask R-CNN network to incorporate human-specified partial boundaries and presents a dataset of hand-drawn partial object boundaries, which are referred to as human attention maps.

Abstract

Instance segmentation is a form of image detection which has a range of applications, such as object refinement, medical image analysis, and image/video editing, all of which demand a high degree of accuracy. However, this precision is often beyond the reach of what even state-of-the-art, fully automated instance segmentation algorithms can deliver. The performance gap becomes particularly prohibitive for small and complex objects. Practitioners typically resort to fully manual annotation, which can be a laborious process. In order to overcome this problem, we propose a novel approach to enable more precise predictions and generate higher-quality segmentation masks for high-curvature, complex and small-scale objects. Our human-assisted segmentation model, HAISTA-NET, augments the existing Strong Mask R-CNN network to incorporate human-specified partial boundaries. We also present a dataset of hand-drawn partial object boundaries, which we refer to as human attention maps. In addition, the Partial Sketch Object Boundaries (PSOB) dataset contains hand-drawn partial object boundaries which represent curvatures of an object's ground truth mask with several pixels. Through extensive evaluation using the PSOB dataset, we show that HAISTA-NET outperforms state-of-the art methods such as Mask R-CNN, Strong Mask R-CNN, and Mask2Former, achieving respective increases of +36.7, +29.6, and +26.5 points in AP-Mask metrics for these three models. We hope that our novel approach will set a baseline for future human-aided deep learning models by combining fully automated and interactive instance segmentation architectures.

HAISTA-NET: Human Assisted Instance Segmentation Through Attention

TL;DR

This work proposes a human-assisted segmentation model, HAISTA-NET, which augments the existing Strong Mask R-CNN network to incorporate human-specified partial boundaries and presents a dataset of hand-drawn partial object boundaries, which are referred to as human attention maps.

Abstract

Instance segmentation is a form of image detection which has a range of applications, such as object refinement, medical image analysis, and image/video editing, all of which demand a high degree of accuracy. However, this precision is often beyond the reach of what even state-of-the-art, fully automated instance segmentation algorithms can deliver. The performance gap becomes particularly prohibitive for small and complex objects. Practitioners typically resort to fully manual annotation, which can be a laborious process. In order to overcome this problem, we propose a novel approach to enable more precise predictions and generate higher-quality segmentation masks for high-curvature, complex and small-scale objects. Our human-assisted segmentation model, HAISTA-NET, augments the existing Strong Mask R-CNN network to incorporate human-specified partial boundaries. We also present a dataset of hand-drawn partial object boundaries, which we refer to as human attention maps. In addition, the Partial Sketch Object Boundaries (PSOB) dataset contains hand-drawn partial object boundaries which represent curvatures of an object's ground truth mask with several pixels. Through extensive evaluation using the PSOB dataset, we show that HAISTA-NET outperforms state-of-the art methods such as Mask R-CNN, Strong Mask R-CNN, and Mask2Former, achieving respective increases of +36.7, +29.6, and +26.5 points in AP-Mask metrics for these three models. We hope that our novel approach will set a baseline for future human-aided deep learning models by combining fully automated and interactive instance segmentation architectures.
Paper Structure (17 sections, 2 equations, 5 figures, 5 tables)

This paper contains 17 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: First glance of precise mask prediction by HAISTA-NET using human attention maps dataset (center image). HAISTA-NET outperforms the mask prediction of Strong Mask R-CNN (right) on high-curvature objects.
  • Figure 2: Method Outline. Users draw a couple of pixels according to their attention to the object. Then, a Human Attention Map is generated to concatenate with the RGB image to feed the model. We denote the concatenate operator as $\otimes$.
  • Figure 3: Interactive interface. Users can interact with the target object via this tool.
  • Figure 4: Visualization of the Predictions of HAISTA-NET. We present images with different scales, curvature numbers, and hand-drawing-based assistance types.
  • Figure 5: The graphs demonstrate the main and interaction effects of AP$\textsubscript{Mask}$ and AP$\textsubscript{Bbox}$ values generated using HAISTA-NET.