Table of Contents
Fetching ...

Weakly supervised training of universal visual concepts for multi-domain semantic segmentation

Petra Bevandić, Marin Oršić, Ivan Grubišić, Josip Šarić, Siniša Šegvić

TL;DR

This work considers labels as unions of universal visual concepts as unions of universal visual concepts, which allows seamless and principled learning on multi-domain dataset collections without requiring any relabeling effort and improves within-dataset and cross-dataset generalization.

Abstract

Deep supervised models have an unprecedented capacity to absorb large quantities of training data. Hence, training on multiple datasets becomes a method of choice towards strong generalization in usual scenes and graceful performance degradation in edge cases. Unfortunately, different datasets often have incompatible labels. For instance, the Cityscapes road class subsumes all driving surfaces, while Vistas defines separate classes for road markings, manholes etc. Furthermore, many datasets have overlapping labels. For instance, pickups are labeled as trucks in VIPER, cars in Vistas, and vans in ADE20k. We address this challenge by considering labels as unions of universal visual concepts. This allows seamless and principled learning on multi-domain dataset collections without requiring any relabeling effort. Our method achieves competitive within-dataset and cross-dataset generalization, as well as ability to learn visual concepts which are not separately labeled in any of the training datasets. Experiments reveal competitive or state-of-the-art performance on two multi-domain dataset collections and on the WildDash 2 benchmark.

Weakly supervised training of universal visual concepts for multi-domain semantic segmentation

TL;DR

This work considers labels as unions of universal visual concepts as unions of universal visual concepts, which allows seamless and principled learning on multi-domain dataset collections without requiring any relabeling effort and improves within-dataset and cross-dataset generalization.

Abstract

Deep supervised models have an unprecedented capacity to absorb large quantities of training data. Hence, training on multiple datasets becomes a method of choice towards strong generalization in usual scenes and graceful performance degradation in edge cases. Unfortunately, different datasets often have incompatible labels. For instance, the Cityscapes road class subsumes all driving surfaces, while Vistas defines separate classes for road markings, manholes etc. Furthermore, many datasets have overlapping labels. For instance, pickups are labeled as trucks in VIPER, cars in Vistas, and vans in ADE20k. We address this challenge by considering labels as unions of universal visual concepts. This allows seamless and principled learning on multi-domain dataset collections without requiring any relabeling effort. Our method achieves competitive within-dataset and cross-dataset generalization, as well as ability to learn visual concepts which are not separately labeled in any of the training datasets. Experiments reveal competitive or state-of-the-art performance on two multi-domain dataset collections and on the WildDash 2 benchmark.
Paper Structure (29 sections, 12 equations, 16 figures, 10 tables)

This paper contains 29 sections, 12 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Common inconsistencies among popular datasets. Images from ADE20k, Viper and Vistas (top) are labeled with discrepant granularity: class road from ADE20k and VIPER corresponds to 7 classes in Vistas: marking-other, manhole, crosswalk, road-other etc (middle). These datasets have also overlapping labels (bottom): pickups are labeled as truck in VIPER, car in Vistas, and van in ADE20k.
  • Figure 2: We address the semantic overlaps from the bottom row of Fig. \ref{['fig:intro']} by mapping each label (left) to a set of disjoint universal classes (right). Our universal taxonomy spans the entire semantic range of the considered dataset collection.
  • Figure 3: Our models allow universal inference in the wild (left) as well as multi-dataset training and validation on dataset-specific labels (right).
  • Figure 4: Construction of a universal taxonomy. We collect all dataset-specific classes into the multiset $\mathbbmsl{M}$ (left). Then, we iteratively modify $\mathbbmsl{M}$ according to the three resolution rules from section \ref{['ss:method-taxonomy']} (top-right). The iteration continues until all classes in $\mathbbmsl{M}$ are disjoint (center). Finaly, we filter universal classes that can not be trained with the available supervision (bottom-right).
  • Figure 5: We propose an extension of the M2F architecture with fixed matching that can be trained with our universal taxonomy. The model assigns pixels to universal classes according to sigmoid-activated dot products between mask embeddings and pixel-level embeddings. We recover dataset-specific masks as maximum pixel-level assignments over the associated universal classes (\ref{['eq:nll-max']}).
  • ...and 11 more figures