Table of Contents
Fetching ...

Towards Instance Segmentation with Polygon Detection Transformers

Jiacheng Sun, Jiaqi Lin, Wenlong Hu, Haoyang Li, Xinghong Zhou, Chenghai Mao, Yan Peng, Xiaomao Li

TL;DR

A Polygon Detection Transformer (Poly-DETR) is presented to reformulate instance segmentation as sparse vertex regression via Polar Representation, thereby eliminating the reliance on dense pixel-wise mask prediction.

Abstract

One of the bottlenecks for instance segmentation today lies in the conflicting requirements of high-resolution inputs and lightweight, real-time inference. To address this bottleneck, we present a Polygon Detection Transformer (Poly-DETR) to reformulate instance segmentation as sparse vertex regression via Polar Representation, thereby eliminating the reliance on dense pixel-wise mask prediction. Considering the box-to-polygon reference shift in Detection Transformers, we propose Polar Deformable Attention and Position-Aware Training Scheme to dynamically update supervision and focus attention on boundary cues. Compared with state-of-the-art polar-based methods, Poly-DETR achieves a 4.7 mAP improvement on MS COCO test-dev. Moreover, we construct a parallel mask-based counterpart to support a systematic comparison between polar and mask representations. Experimental results show that Poly-DETR is more lightweight in high-resolution scenarios, reducing memory consumption by almost half on Cityscapes dataset. Notably, on PanNuke (cell segmentation) and SpaceNet (building footprints) datasets, Poly-DETR surpasses its mask-based counterpart on all metrics, which validates its advantage on regular-shaped instances in domain-specific settings.

Towards Instance Segmentation with Polygon Detection Transformers

TL;DR

A Polygon Detection Transformer (Poly-DETR) is presented to reformulate instance segmentation as sparse vertex regression via Polar Representation, thereby eliminating the reliance on dense pixel-wise mask prediction.

Abstract

One of the bottlenecks for instance segmentation today lies in the conflicting requirements of high-resolution inputs and lightweight, real-time inference. To address this bottleneck, we present a Polygon Detection Transformer (Poly-DETR) to reformulate instance segmentation as sparse vertex regression via Polar Representation, thereby eliminating the reliance on dense pixel-wise mask prediction. Considering the box-to-polygon reference shift in Detection Transformers, we propose Polar Deformable Attention and Position-Aware Training Scheme to dynamically update supervision and focus attention on boundary cues. Compared with state-of-the-art polar-based methods, Poly-DETR achieves a 4.7 mAP improvement on MS COCO test-dev. Moreover, we construct a parallel mask-based counterpart to support a systematic comparison between polar and mask representations. Experimental results show that Poly-DETR is more lightweight in high-resolution scenarios, reducing memory consumption by almost half on Cityscapes dataset. Notably, on PanNuke (cell segmentation) and SpaceNet (building footprints) datasets, Poly-DETR surpasses its mask-based counterpart on all metrics, which validates its advantage on regular-shaped instances in domain-specific settings.
Paper Structure (36 sections, 14 equations, 10 figures, 8 tables)

This paper contains 36 sections, 14 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparison of Mask and Polar Representations in Transformers. In (a) and (b), shared components are shown in gray. Mask-based methods require an additional branch (in blue) for high-resolution mask features, whereas Poly-DETR directly predicts polar parameters. In (c), as input resolution increases, Poly-DETR achieves lower latency and memory than the mask-based counterpart.
  • Figure 2: Two practical mismatches in DETR-style detectors. (a) The starting point shifts from its origin (green) to the optimized one (red). The static reference yields a polygon that drifts away from the contour (yellow), while the dynamic reference keeps the reconstructed polygon aligned. (b) Deformable Attention sampling locations (white crosses) and their density heatmap show that box-oriented sampling concentrates around box cues, whereas polar-oriented sampling favors boundary regions.
  • Figure 3: Overview of the proposed Poly-DETR. The upper part illustrates the query-to-polygon pipeline. The bottom-left part shows the fan-shaped attention grid in Polar Deformable Attention, and the bottom-right part presents the reference updating process in Position-Aware Training Scheme.
  • Figure 4: Motivating example of starting point selection. On the feature pyramid, high-score grid locations (orange) are selected from the center region (gray) as starting points. Their corresponding polygons are visualized on the right. Moreover, the optimal polygon with minimum Representation Error (RE) and its starting point are also visualized and marked in blue.
  • Figure 5: Illustration of Hybrid Supervision Strategy in Poly-DETR.
  • ...and 5 more figures