Table of Contents
Fetching ...

Learning to Fuse Things and Stuff

Jie Li, Allan Raventos, Arjun Bhargava, Takaaki Tagawa, Adrien Gaidon

TL;DR

This work presents TASCNet, an end-to-end panoptic segmentation model that unifies things (instances) and stuff (semantic) using a shared ResNet+FPN backbone. It introduces a differentiable Things and Stuff Consistency (TASC) loss and a RoI-Flatten-based global mask, enabling mask-guided fusion to assemble panoptic outputs in a single pass. The approach achieves competitive ${PQ}$ on Cityscapes, Mapillary Vistas, and COCO with a single network and fewer parameters than separate models, demonstrating the value of explicit cross-task coupling for joint scene understanding. Overall, the method advances unified dense scene understanding by aligning instance and semantic predictions and simplifying training and inference workflows.

Abstract

We propose an end-to-end learning approach for panoptic segmentation, a novel task unifying instance (things) and semantic (stuff) segmentation. Our model, TASCNet, uses feature maps from a shared backbone network to predict in a single feed-forward pass both things and stuff segmentations. We explicitly constrain these two output distributions through a global things and stuff binary mask to enforce cross-task consistency. Our proposed unified network is competitive with the state of the art on several benchmarks for panoptic segmentation as well as on the individual semantic and instance segmentation tasks.

Learning to Fuse Things and Stuff

TL;DR

This work presents TASCNet, an end-to-end panoptic segmentation model that unifies things (instances) and stuff (semantic) using a shared ResNet+FPN backbone. It introduces a differentiable Things and Stuff Consistency (TASC) loss and a RoI-Flatten-based global mask, enabling mask-guided fusion to assemble panoptic outputs in a single pass. The approach achieves competitive on Cityscapes, Mapillary Vistas, and COCO with a single network and fewer parameters than separate models, demonstrating the value of explicit cross-task coupling for joint scene understanding. Overall, the method advances unified dense scene understanding by aligning instance and semantic predictions and simplifying training and inference workflows.

Abstract

We propose an end-to-end learning approach for panoptic segmentation, a novel task unifying instance (things) and semantic (stuff) segmentation. Our model, TASCNet, uses feature maps from a shared backbone network to predict in a single feed-forward pass both things and stuff segmentations. We explicitly constrain these two output distributions through a global things and stuff binary mask to enforce cross-task consistency. Our proposed unified network is competitive with the state of the art on several benchmarks for panoptic segmentation as well as on the individual semantic and instance segmentation tasks.

Paper Structure

This paper contains 14 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: We propose an end-to-end architecture for panoptic segmentation. Our model predicts things and stuff with a shared backbone and an internal mask enforcing Things and Stuff Consistency (TASC) that can be used to guide fusion.
  • Figure 2: TASCNet: Our unified architecture jointly predicts things, stuff, and a fusion mask. The proposed heads are built on top of a ResNet + FPN backbone. The Stuff Head uses fully convolutional layers to densely predict all stuff classes and an additional things mask. The Things Head uses region-based CNN layers for instance detection and segmentation. In between these two prediction heads, we propose Things and Stuff Consistency loss to ensure alignment between the predictions.
  • Figure 3: RoI-Flatten. We proposed a differentiable operation to merge individual proposal masks into a binary mask to provide global constrain across tasks.
  • Figure 4: Residual Example. Example image of residuals from a model trained without TASC (left) and a model trained with TASC (right).
  • Figure 5: Panoptic segmentation examples from Cityscapes and Mapillary Vistas. In panoptic segmentation results, different instances are color-coded with different colors with small variation from the base color of their semantic class. In matched segments, segments belongs to true positives are marked as white, while false positive and false negative segments are marked as black.
  • ...and 2 more figures