Table of Contents
Fetching ...

MP-Former: Mask-Piloted Transformer for Image Segmentation

Hao Zhang, Feng Li, Huaizhe Xu, Shijia Huang, Shilong Liu, Lionel M. Ni, Lei Zhang

TL;DR

MP-Former introduces a mask-piloted training scheme to address inconsistent decoder-layer predictions in Mask2Former. By injecting ground-truth masks as attention masks and ground-truth class embeddings as decoder queries (the MP part), the method strengthens cross-layer consistency and yields more accurate gradients, improving instance, panoptic, and semantic segmentation while maintaining inference cost. Theoretical analysis supports increased matching stability and gradient fidelity, and empirical results show substantial gains (e.g., $+2.3$AP on Cityscapes instance and $+1.6$mIoU on semantic with a $ResNet-50$ backbone) along with faster convergence on ADE20K across backbones. Importantly, training speed improves with minimal overhead and no extra inference cost, and the authors provide code to reproduce the results.

Abstract

We present a mask-piloted Transformer which improves masked-attention in Mask2Former for image segmentation. The improvement is based on our observation that Mask2Former suffers from inconsistent mask predictions between consecutive decoder layers, which leads to inconsistent optimization goals and low utilization of decoder queries. To address this problem, we propose a mask-piloted training approach, which additionally feeds noised ground-truth masks in masked-attention and trains the model to reconstruct the original ones. Compared with the predicted masks used in mask-attention, the ground-truth masks serve as a pilot and effectively alleviate the negative impact of inaccurate mask predictions in Mask2Former. Based on this technique, our \M achieves a remarkable performance improvement on all three image segmentation tasks (instance, panoptic, and semantic), yielding $+2.3$AP and $+1.6$mIoU on the Cityscapes instance and semantic segmentation tasks with a ResNet-50 backbone. Our method also significantly speeds up the training, outperforming Mask2Former with half of the number of training epochs on ADE20K with both a ResNet-50 and a Swin-L backbones. Moreover, our method only introduces little computation during training and no extra computation during inference. Our code will be released at \url{https://github.com/IDEA-Research/MP-Former}.

MP-Former: Mask-Piloted Transformer for Image Segmentation

TL;DR

MP-Former introduces a mask-piloted training scheme to address inconsistent decoder-layer predictions in Mask2Former. By injecting ground-truth masks as attention masks and ground-truth class embeddings as decoder queries (the MP part), the method strengthens cross-layer consistency and yields more accurate gradients, improving instance, panoptic, and semantic segmentation while maintaining inference cost. Theoretical analysis supports increased matching stability and gradient fidelity, and empirical results show substantial gains (e.g., AP on Cityscapes instance and mIoU on semantic with a backbone) along with faster convergence on ADE20K across backbones. Importantly, training speed improves with minimal overhead and no extra inference cost, and the authors provide code to reproduce the results.

Abstract

We present a mask-piloted Transformer which improves masked-attention in Mask2Former for image segmentation. The improvement is based on our observation that Mask2Former suffers from inconsistent mask predictions between consecutive decoder layers, which leads to inconsistent optimization goals and low utilization of decoder queries. To address this problem, we propose a mask-piloted training approach, which additionally feeds noised ground-truth masks in masked-attention and trains the model to reconstruct the original ones. Compared with the predicted masks used in mask-attention, the ground-truth masks serve as a pilot and effectively alleviate the negative impact of inaccurate mask predictions in Mask2Former. Based on this technique, our \M achieves a remarkable performance improvement on all three image segmentation tasks (instance, panoptic, and semantic), yielding AP and mIoU on the Cityscapes instance and semantic segmentation tasks with a ResNet-50 backbone. Our method also significantly speeds up the training, outperforming Mask2Former with half of the number of training epochs on ADE20K with both a ResNet-50 and a Swin-L backbones. Moreover, our method only introduces little computation during training and no extra computation during inference. Our code will be released at \url{https://github.com/IDEA-Research/MP-Former}.
Paper Structure (19 sections, 15 equations, 3 figures, 8 tables)

This paper contains 19 sections, 15 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: We visualize layer-wise predictions of Mask2Former and show $4$ pairs of failure cases. Each pair is the predictions of the same query in adjacent decoder layers. The red regions are the predicted masks. These cases shows that the predictions of a query may change dramatically between consecutive layers.
  • Figure 2: A comparison of Transformer decoder of our and Mask2Former. feeds extra queries and attention masks which are marked by dashed lines. The queries are taken as the class embedding of the GT categories and the attention masks are GT masks. Since we adopt multi-layer mask guide, we feed GT masks as attention masks not only in the first layer but also in subsequent layers. However, we did not feed class embeddings in multiple layers.
  • Figure 3: The architecture of our method is the same as Mask2Former (the blue-shaded part), which consists of a backbone, a pixel decoder, and a Transformer decoder. The difference is that we feed extra queries and attention masks which are called the MP part to the Transformer decoder (red-line part in the figure). The MP part contains GT masks as attention masks and GT class embeddings as queries. We feed GT masks into the MP part of all decoder layers. We also add point noises to GT masks and flipping noises to class embeddings which can further improve the performance. Note that this architecture is just for training. In the inference time, the red-line part does not exist, and thus, our pipeline is exactly the same as Mask2Former.