Table of Contents
Fetching ...

FreDSNet: Joint Monocular Depth and Semantic Segmentation with Fast Fourier Convolutions

Bruno Berenguel-Baeta, Jesus Bermudez-Cameo, Jose J. Guerrero

TL;DR

FreDSNet addresses monocular depth estimation and semantic segmentation from a single equirectangular panorama by introducing fast Fourier convolutions to expand the receptive field and capture global context. The model employs an encoder-decoder with Fourier-augmented blocks and two task-specific branches, trained jointly with a composite loss that includes segmentation, depth, and auxiliary terms to enhance depth range and object boundaries. Empirical results on Stanford2D3DS show performance on par with state-of-the-art methods for both tasks, while enabling real-time inference (~33 FPS) and providing rich scene representations for navigation and AR/VR applications. This work demonstrates the value of frequency-domain convolutions for omnidirectional scene understanding and highlights the benefits of joint depth-segmentation learning in panoramic imagery.

Abstract

In this work we present FreDSNet, a deep learning solution which obtains semantic 3D understanding of indoor environments from single panoramas. Omnidirectional images reveal task-specific advantages when addressing scene understanding problems due to the 360-degree contextual information about the entire environment they provide. However, the inherent characteristics of the omnidirectional images add additional problems to obtain an accurate detection and segmentation of objects or a good depth estimation. To overcome these problems, we exploit convolutions in the frequential domain obtaining a wider receptive field in each convolutional layer. These convolutions allow to leverage the whole context information from omnidirectional images. FreDSNet is the first network that jointly provides monocular depth estimation and semantic segmentation from a single panoramic image exploiting fast Fourier convolutions. Our experiments show that FreDSNet has similar performance as specific state of the art methods for semantic segmentation and depth estimation. FreDSNet code is publicly available in https://github.com/Sbrunoberenguel/FreDSNet

FreDSNet: Joint Monocular Depth and Semantic Segmentation with Fast Fourier Convolutions

TL;DR

FreDSNet addresses monocular depth estimation and semantic segmentation from a single equirectangular panorama by introducing fast Fourier convolutions to expand the receptive field and capture global context. The model employs an encoder-decoder with Fourier-augmented blocks and two task-specific branches, trained jointly with a composite loss that includes segmentation, depth, and auxiliary terms to enhance depth range and object boundaries. Empirical results on Stanford2D3DS show performance on par with state-of-the-art methods for both tasks, while enabling real-time inference (~33 FPS) and providing rich scene representations for navigation and AR/VR applications. This work demonstrates the value of frequency-domain convolutions for omnidirectional scene understanding and highlights the benefits of joint depth-segmentation learning in panoramic imagery.

Abstract

In this work we present FreDSNet, a deep learning solution which obtains semantic 3D understanding of indoor environments from single panoramas. Omnidirectional images reveal task-specific advantages when addressing scene understanding problems due to the 360-degree contextual information about the entire environment they provide. However, the inherent characteristics of the omnidirectional images add additional problems to obtain an accurate detection and segmentation of objects or a good depth estimation. To overcome these problems, we exploit convolutions in the frequential domain obtaining a wider receptive field in each convolutional layer. These convolutions allow to leverage the whole context information from omnidirectional images. FreDSNet is the first network that jointly provides monocular depth estimation and semantic segmentation from a single panoramic image exploiting fast Fourier convolutions. Our experiments show that FreDSNet has similar performance as specific state of the art methods for semantic segmentation and depth estimation. FreDSNet code is publicly available in https://github.com/Sbrunoberenguel/FreDSNet
Paper Structure (12 sections, 5 equations, 5 figures, 3 tables)

This paper contains 12 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of our proposal. From a single RGB panorama (up left), we make a semantic segmentation (up right) and estimate a depth map (down left) of an indoor environment. With this information we are able to reconstruct in 3D the whole environment (down right).
  • Figure 2: Architecture of our Frequential Depth estimation and Semantic segmentation Network (FreDSNet). The encoder part is formed by a feature extractor (ResNet) and four encoder blocks. The decoder part is formed by six decoding blocks and two branches that predict depth and semantic segmentation. The skip connections from the encoder to the decoder use learned weights.
  • Figure 3: a) FBC-N: encoder block composed by a Fourier Block (FB), skip connection for the Decoder part, Down-Scaling (N) of scale N and a W-Conv (C). b) CFB-N: decoder block composed by a W-Conv (C), Up-Scaling (N) of scale N, addition of a skip connection from the Encoder part and a Fourier Block (FB).
  • Figure 4: Qualitative comparison between HohoNet sun2021hohonet and our proposal for semantic segmentation and depth estimation in Stanford2D3DS armeni2017joint.
  • Figure 5: In the first row: RGB is the input of our network which outputs the Semantic Segmentation and Depth estimation. In the second row: different useful environment representations that can be obtained from the output information provided by FreDSNet. (For a better representation, the ceiling has been removed from all visualizations)