Table of Contents
Fetching ...

Y-MAP-Net: Real-time depth, normals, segmentation, multi-label captioning and 2D human pose in RGB images

Ammar Qammaz, Nikolaos Vasilikopoulos, Iason Oikonomidis, Antonis A. Argyros

TL;DR

Y-MAP-Net, a Y-shaped neural network architecture designed for real-time multi-task learning on RGB images, adopt a multi-teacher, single-student training paradigm, enabling it to distill their capabilities into a lightweight architecture suitable for real-time applications.

Abstract

We present Y-MAP-Net, a Y-shaped neural network architecture designed for real-time multi-task learning on RGB images. Y-MAP-Net, simultaneously predicts depth, surface normals, human pose, semantic segmentation and generates multi-label captions, all from a single network evaluation. To achieve this, we adopt a multi-teacher, single-student training paradigm, where task-specific foundation models supervise the network's learning, enabling it to distill their capabilities into a lightweight architecture suitable for real-time applications. Y-MAP-Net, exhibits strong generalization, simplicity and computational efficiency, making it ideal for robotics and other practical scenarios. To support future research, we will release our code publicly.

Y-MAP-Net: Real-time depth, normals, segmentation, multi-label captioning and 2D human pose in RGB images

TL;DR

Y-MAP-Net, a Y-shaped neural network architecture designed for real-time multi-task learning on RGB images, adopt a multi-teacher, single-student training paradigm, enabling it to distill their capabilities into a lightweight architecture suitable for real-time applications.

Abstract

We present Y-MAP-Net, a Y-shaped neural network architecture designed for real-time multi-task learning on RGB images. Y-MAP-Net, simultaneously predicts depth, surface normals, human pose, semantic segmentation and generates multi-label captions, all from a single network evaluation. To achieve this, we adopt a multi-teacher, single-student training paradigm, where task-specific foundation models supervise the network's learning, enabling it to distill their capabilities into a lightweight architecture suitable for real-time applications. Y-MAP-Net, exhibits strong generalization, simplicity and computational efficiency, making it ideal for robotics and other practical scenarios. To support future research, we will release our code publicly.

Paper Structure

This paper contains 15 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Given an RGB frame, Y-MAP-Net estimates human pose, depth, surface normals, segmentation and image captioning in real-time. The figure shows pose keypoints, depth and normal estimations on publicly available images from factory floors.
  • Figure 2: Architecture of the proposed Y-MAP-Net. The flow of data from input to outputs is highlighted with cyan arrows. Residual connections are indicated with + signs, while rectangles give a dimensionality overview for each layer. Green layers (top left) signify encoder blocks originating from 1 RGB image. The network bridge layers appear in orange color (middle). Blue layers (bottom) depict 8 captioning output token GloVe pennington2014glove vectors. These can be optionally converted to multi-hot labels with the addition of 2 dense layers for some applications. Magenta (top right) signifies decoder blocks that lead to 44 multi-modal outputs. Detailed encoder and decoder layer architecture is provided in Figure \ref{['fig:encdec']}.
  • Figure 3: Encoder (green) and decoder (magenta) blocks of Y-MAP-Net (Fig. \ref{['fig:model']}) are tasked with down-scaling input images to the bridge representation and then up-scaling to pictorial outputs. They consist of layers and skip connections shown in this figure.
  • Figure 4: NN output normals (right) enforced on NN depth output (left) through our iterative algorithm can improve depth (mid).
  • Figure 5: The frequency of tokens encountered while captioning an image follows a very heavy-tail distribution. Red: Top-80 frequency words, when not using stop-words stopwords. Token 'a' appears several orders of magnitude more than e.g. the token 'cat'. This very heavy class imbalance negatively impacts training. Green: After removing tokens: '(', ')', '.', 'a', 'an', 's', 'of', 'on', 'and', 'I', 'in', 'the', 'is', 'it', 'at', 'to', 'with', 'for' and 'from', we get the second distribution (green frame) which is more balanced.
  • ...and 1 more figures