Table of Contents
Fetching ...

BridgeNet: Comprehensive and Effective Feature Interactions via Bridge Feature for Multi-task Dense Predictions

Jingdong Zhang, Jiayuan Fan, Peng Ye, Bo Zhang, Hancheng Ye, Baopu Li, Yancheng Cai, Tao Chen

TL;DR

This work proposes a novel BridgeNet framework, which extracts comprehensive and discriminative intermediate Bridge Features, and conducts interactions based on them, and is the first work considering the completeness and quality of feature participants in cross-task interactions.

Abstract

Multi-task dense prediction aims at handling multiple pixel-wise prediction tasks within a unified network simultaneously for visual scene understanding. However, cross-task feature interactions of current methods are still suffering from incomplete levels of representations, less discriminative semantics in feature participants, and inefficient pair-wise task interaction processes. To tackle these under-explored issues, we propose a novel BridgeNet framework, which extracts comprehensive and discriminative intermediate Bridge Features, and conducts interactions based on them. Specifically, a Task Pattern Propagation (TPP) module is firstly applied to ensure highly semantic task-specific feature participants are prepared for subsequent interactions, and a Bridge Feature Extractor (BFE) is specially designed to selectively integrate both high-level and low-level representations to generate the comprehensive bridge features. Then, instead of conducting heavy pair-wise cross-task interactions, a Task-Feature Refiner (TFR) is developed to efficiently take guidance from bridge features and form final task predictions. To the best of our knowledge, this is the first work considering the completeness and quality of feature participants in cross-task interactions. Extensive experiments are conducted on NYUD-v2, Cityscapes and PASCAL Context benchmarks, and the superior performance shows the proposed architecture is effective and powerful in promoting different dense prediction tasks simultaneously.

BridgeNet: Comprehensive and Effective Feature Interactions via Bridge Feature for Multi-task Dense Predictions

TL;DR

This work proposes a novel BridgeNet framework, which extracts comprehensive and discriminative intermediate Bridge Features, and conducts interactions based on them, and is the first work considering the completeness and quality of feature participants in cross-task interactions.

Abstract

Multi-task dense prediction aims at handling multiple pixel-wise prediction tasks within a unified network simultaneously for visual scene understanding. However, cross-task feature interactions of current methods are still suffering from incomplete levels of representations, less discriminative semantics in feature participants, and inefficient pair-wise task interaction processes. To tackle these under-explored issues, we propose a novel BridgeNet framework, which extracts comprehensive and discriminative intermediate Bridge Features, and conducts interactions based on them. Specifically, a Task Pattern Propagation (TPP) module is firstly applied to ensure highly semantic task-specific feature participants are prepared for subsequent interactions, and a Bridge Feature Extractor (BFE) is specially designed to selectively integrate both high-level and low-level representations to generate the comprehensive bridge features. Then, instead of conducting heavy pair-wise cross-task interactions, a Task-Feature Refiner (TFR) is developed to efficiently take guidance from bridge features and form final task predictions. To the best of our knowledge, this is the first work considering the completeness and quality of feature participants in cross-task interactions. Extensive experiments are conducted on NYUD-v2, Cityscapes and PASCAL Context benchmarks, and the superior performance shows the proposed architecture is effective and powerful in promoting different dense prediction tasks simultaneously.
Paper Structure (18 sections, 9 equations, 10 figures, 8 tables)

This paper contains 18 sections, 9 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Illustration of multi-task interaction strategies. (a) Encoder-focused misra2016crossliu2019endgao2019nddrlu2017fullyvandenhende2019branchedguo2020learningbruggemann2020automated: the task-specific features directly interact with the task-generic features from the common backbone. The task-generic features contain rich low-level representations but lack high-level representations. The interaction complexity is $O\left(n\right)$. (b) Decoder-focused xu2018padcrawshaw2020multibruggemann2021exploringzhang2019patternzhou2020patternye2022inverted: the task-specific features are firstly produced by deep supervision, and then interactions are conducted only based on task-specific features, which has discriminative high-level representations, but low-level representations are absent. The pair-wise interaction complexity is $O\left(n^{2}\right)$. (c) BridgeNet (ours): the bridge features are produced from both task-generic and task-specific features and have both rich high-level and low-level representations. The subsequent interactions based on bridge features only have $O\left(n\right)$ complexity. On the right part of the legend, we show some examples of representations from different levels in the feature map: the low-level representations contain less discriminative task information but are rich in details like boundaries, corners and textures, on the contrary, the high-level representations are smooth with less image details, but highly correlated to the task properties, like the highlighted floor with semantic, or areas with larger distance from the depth sensor.
  • Figure 2: The overview of our proposed method. We take depth estimation and semantic segmentation as examples. We use a shared image backbone to encode task-generic features, and a set of preliminary decoders with deep supervision are used for task-specific feature extraction. Different from decoder-focused methods, a TPP is added before the initial predictions are formed to tackle the task-pattern-entanglement issue. The produced high-quality task-specific features are not used for interaction directly, they are firstly processed by BFE at each scale to gain low-level representations and form the bridge features. Then, the multi-scale bridge features are used to gradually refine task-specific features by TFR. The outputs of TFR are aggregated and fed into task-specific heads for final predictions. The detailed structures of TPP (Sec. \ref{['sec:S-MSA']}, Fig. \ref{['fig:s-msa']} (a)), BFE (Sec. \ref{['sec:DS']}, Fig. \ref{['fig:BFE']}) and TFR (Sec. \ref{['sec:CMSAFPN']}, Fig. \ref{['fig:s-msa']} (b)) will be given later.
  • Figure 3: Left: Visualization of different patterns of two tasks (semantic segmentation and depth estimation). The semantic feature focuses on objects with various semantic information, like cabinets and floors, while depth attention focuses more on the surface and edge features with geometry information. Right: Visualization of the task-specific features with (w.) and without (w.o.) TPP. The task-generic features produced by the shared encoder show the task-pattern-entanglement issue, which is significantly different from the distributions of task labels, and the contained patterns are implicit and ambiguous. Without TPP, the decomposed task-specific features serving as the interaction participants are still struggling with the lack of discriminative semantics, which negatively affects the subsequent interaction process (BFE, TFR) and eventually produces low-quality interaction outputs. However, with TPP, the task-specific features can obtain well-decomposed representations that have more similar distributions to their ground-truth labels (like the highlighted floor area in semantic segmentation and door area in depth estimation), and thus boosts the subsequent interaction process to produce high-quality discriminative features.
  • Figure 4: (a) The structure of TPP which is applied after the last backbone layer. The propagation attention map shares the patterns for each task. (b) Task Feature Refiner (TFR), employs a cascaded structure that can be flexibly deployed within the decoding process to inject rich representations from bridge features into each task-specific feature. TFRs of different scales have similar structures and we illustrate the specific structure of one particular scale here.
  • Figure 5: A zoom-in version of one-scale BFE in Fig.2 (Different scales have similar structures). The input task-generic and task-specific features are firstly transformed into patch tokens, and then we use the task-generic tokens to query all of the task-specific tokens globally to produce a global correlation map, which helps select useful high-level information to gain high-level representations.
  • ...and 5 more figures