Table of Contents
Fetching ...

DepthTCM: High Efficient Depth Compression via Physics-aware Transformer-CNN Mixed Architecture

Young-Seo Chang, Yatong An, Jae-Sang Hyun

Abstract

We propose DepthTCM, a physics-aware end-to-end framework for depth map compression. In our framework of DepthTCM, the high-bit depth map is first converted to a conventional 3-channel image representation losslessly using a method inspired by a physical sinusoidal fringe pattern based profiliometry system, then the 3-channel color image is encoded and decoded by a recently developed Transformer-CNN mixed neural network architecture. Specifically, DepthTCM maps depth to a smooth 3-channel using multiwavelength depth (MWD) encoding, then globally quantized the MWD encoded representation to 4 bits per channel to reduce entropy, and finally is compressed using a learned codec that combines convolutional and Transformer layers. Experiment results demonstrate the advantage of our proposed method. On Middlebury 2014, DepthTCM reaches 0.307 bpp while preserving 99.38% accuracy, a level of fidelity commensurate with lossless PNG. We additionally demonstrate practical efficiency and scalability, reporting average end-to-end inference times of 41.48 ms (encoder) and 47.45 ms (decoder) on the ScanNet++ iPhone RGB-D subset. Ablations validate our design choices: relative to 8-bit quantization, 4-bit quantization reduces bitrate by 66% while maintaining comparable reconstruction quality, with only a marginal 0.68 dB PSNR change and a 0.04% accuracy difference. In addition, Transformer--CNN blocks further improve PSNR by up to 0.75 dB over CNN-only architectures.

DepthTCM: High Efficient Depth Compression via Physics-aware Transformer-CNN Mixed Architecture

Abstract

We propose DepthTCM, a physics-aware end-to-end framework for depth map compression. In our framework of DepthTCM, the high-bit depth map is first converted to a conventional 3-channel image representation losslessly using a method inspired by a physical sinusoidal fringe pattern based profiliometry system, then the 3-channel color image is encoded and decoded by a recently developed Transformer-CNN mixed neural network architecture. Specifically, DepthTCM maps depth to a smooth 3-channel using multiwavelength depth (MWD) encoding, then globally quantized the MWD encoded representation to 4 bits per channel to reduce entropy, and finally is compressed using a learned codec that combines convolutional and Transformer layers. Experiment results demonstrate the advantage of our proposed method. On Middlebury 2014, DepthTCM reaches 0.307 bpp while preserving 99.38% accuracy, a level of fidelity commensurate with lossless PNG. We additionally demonstrate practical efficiency and scalability, reporting average end-to-end inference times of 41.48 ms (encoder) and 47.45 ms (decoder) on the ScanNet++ iPhone RGB-D subset. Ablations validate our design choices: relative to 8-bit quantization, 4-bit quantization reduces bitrate by 66% while maintaining comparable reconstruction quality, with only a marginal 0.68 dB PSNR change and a 0.04% accuracy difference. In addition, Transformer--CNN blocks further improve PSNR by up to 0.75 dB over CNN-only architectures.
Paper Structure (18 sections, 12 equations, 3 figures, 12 tables)

This paper contains 18 sections, 12 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: DepthTCM framework architecture. The framework integrates multiwavelength depth (MWD) encoding, 4-bit quantization, and a hybrid Transformer--CNN compression backbone into a fully differentiable end-to-end system for efficient depth map compression and reconstruction.
  • Figure 2: Visualization of Multiwavelength Depth Encoding. (a) Input 32-bit depth map, (b) 8-bit Quantized MWD Encoded Map, (c) 4-bit Quantized MWD Encoded Map, (d) Red channel: sinusoidal encoding, $\sin(2\pi Z/P)$. (e) Green channel: sinusoidal encoding, $\cos(2\pi Z/P)$, (f) Blue channel: normalized depth for long-wavelength variation.
  • Figure 3: Rate--distortion curves on the Middlebury 2014 dataset. NRMSE (%) is plotted against bitrate (bpp) measured per original depth pixel. The error is computed following the evaluation protocol of N-DEPTH, where metrics are calculated over the shared intersection of valid recovery regions. Rate--distortion operating points of our method are obtained by varying the Lagrangian multiplier $\lambda$.