Learning to Fuse Things and Stuff
Jie Li, Allan Raventos, Arjun Bhargava, Takaaki Tagawa, Adrien Gaidon
TL;DR
This work presents TASCNet, an end-to-end panoptic segmentation model that unifies things (instances) and stuff (semantic) using a shared ResNet+FPN backbone. It introduces a differentiable Things and Stuff Consistency (TASC) loss and a RoI-Flatten-based global mask, enabling mask-guided fusion to assemble panoptic outputs in a single pass. The approach achieves competitive ${PQ}$ on Cityscapes, Mapillary Vistas, and COCO with a single network and fewer parameters than separate models, demonstrating the value of explicit cross-task coupling for joint scene understanding. Overall, the method advances unified dense scene understanding by aligning instance and semantic predictions and simplifying training and inference workflows.
Abstract
We propose an end-to-end learning approach for panoptic segmentation, a novel task unifying instance (things) and semantic (stuff) segmentation. Our model, TASCNet, uses feature maps from a shared backbone network to predict in a single feed-forward pass both things and stuff segmentations. We explicitly constrain these two output distributions through a global things and stuff binary mask to enforce cross-task consistency. Our proposed unified network is competitive with the state of the art on several benchmarks for panoptic segmentation as well as on the individual semantic and instance segmentation tasks.
