Table of Contents
Fetching ...

FocDepthFormer: Transformer with latent LSTM for Depth Estimation from Focal Stack

Xueyang Kang, Fengze Han, Abdur R. Fayjie, Patrick Vandewalle, Kourosh Khoshelham, Dong Gong

TL;DR

Depth estimation from focal stacks is addressed by FocDepthFormer, a Transformer–LSTM architecture that handles arbitrary stack lengths via latent-space fusion, aided by a multi-scale encoder and CNN decoder. The method leverages a Vision Transformer to capture non-local spatial cues and an LSTM to fuse information across different focus planes, enabling flexible input sizes. A combined regression and sharpness-aware training objective further improves depth boundaries, and optional monocular-depth pre-training can enhance representation. Across four focal-stack benchmarks, the approach delivers state-of-the-art results and demonstrates strong generalization with competitive runtime, highlighting its potential for robust depth estimation in defocus-rich imaging setups.

Abstract

Most existing methods for depth estimation from a focal stack of images employ convolutional neural networks (CNNs) using 2D or 3D convolutions over a fixed set of images. However, their effectiveness is constrained by the local properties of CNN kernels, which restricts them to process only focal stacks of fixed number of images during both training and inference. This limitation hampers their ability to generalize to stacks of arbitrary lengths. To overcome these limitations, we present a novel Transformer-based network, FocDepthFormer, which integrates a Transformer with an LSTM module and a CNN decoder. The Transformer's self-attention mechanism allows for the learning of more informative spatial features by implicitly performing non-local cross-referencing. The LSTM module is designed to integrate representations across image stacks of varying lengths. Additionally, we employ multi-scale convolutional kernels in an early-stage encoder to capture low-level features at different degrees of focus/defocus. By incorporating the LSTM, FocDepthFormer can be pre-trained on large-scale monocular RGB depth estimation datasets, improving visual pattern learning and reducing reliance on difficult-to-obtain focal stack data. Extensive experiments on diverse focal stack benchmark datasets demonstrate that our model outperforms state-of-the-art approaches across multiple evaluation metrics.

FocDepthFormer: Transformer with latent LSTM for Depth Estimation from Focal Stack

TL;DR

Depth estimation from focal stacks is addressed by FocDepthFormer, a Transformer–LSTM architecture that handles arbitrary stack lengths via latent-space fusion, aided by a multi-scale encoder and CNN decoder. The method leverages a Vision Transformer to capture non-local spatial cues and an LSTM to fuse information across different focus planes, enabling flexible input sizes. A combined regression and sharpness-aware training objective further improves depth boundaries, and optional monocular-depth pre-training can enhance representation. Across four focal-stack benchmarks, the approach delivers state-of-the-art results and demonstrates strong generalization with competitive runtime, highlighting its potential for robust depth estimation in defocus-rich imaging setups.

Abstract

Most existing methods for depth estimation from a focal stack of images employ convolutional neural networks (CNNs) using 2D or 3D convolutions over a fixed set of images. However, their effectiveness is constrained by the local properties of CNN kernels, which restricts them to process only focal stacks of fixed number of images during both training and inference. This limitation hampers their ability to generalize to stacks of arbitrary lengths. To overcome these limitations, we present a novel Transformer-based network, FocDepthFormer, which integrates a Transformer with an LSTM module and a CNN decoder. The Transformer's self-attention mechanism allows for the learning of more informative spatial features by implicitly performing non-local cross-referencing. The LSTM module is designed to integrate representations across image stacks of varying lengths. Additionally, we employ multi-scale convolutional kernels in an early-stage encoder to capture low-level features at different degrees of focus/defocus. By incorporating the LSTM, FocDepthFormer can be pre-trained on large-scale monocular RGB depth estimation datasets, improving visual pattern learning and reducing reliance on difficult-to-obtain focal stack data. Extensive experiments on diverse focal stack benchmark datasets demonstrate that our model outperforms state-of-the-art approaches across multiple evaluation metrics.
Paper Structure (12 sections, 7 equations, 8 figures, 9 tables)

This paper contains 12 sections, 7 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: The overview of our proposed network, FocDepthFormer, is presented with its core components: the Transformer encoder, the recurrent LSTM module, and the CNN decoder. Preceding the Transformer encoder, early-stage multi-scale convolutional kernels are depicted within the dashed line. The resulting multi-scale feature maps are concatenated and subjected to spatial and depth-wise convolution. Subsequently, the fused feature map of a image stack is divided into patches, which are then individually projected by a linear embedding layer into tokens. A red token represents a global embedding token mapped from the entire image and is summed with each individual patch embedding token.
  • Figure 2: To illustrate the LSTM module in our network, the initial step involves grouping the all cached output tokens from the Transformer encoder into activated and non-activated tokens. These two groups are then individually processed, with activated tokens undergoing LSTMs followed by max pooling and non-activated tokens undergoing average pooling. Following this, the output tokens undergo reshaping and concatenation before being fed into the CNN decoder for predicting the depth map.
  • Figure 3: Comparison of Transformer attention between the two left column images. Cropped image patches within green and orange boxes in (a) and (c) serve as query inputs to compute the self-attention map over the entire input image, respectively. In (b) and (d), the attention maps on the left and right sides of the green line illustrate the attention outputs of the green and orange boxes, respectively. This demonstrates the model's capability to selectively attend to both foreground and background areas, distinguishing between focus and defocus cues.
  • Figure 4: Qualitative evaluation of our model on DDFF 12-Scene dataset.
  • Figure 5: Qualitative evaluation of our model on Mobile Depth dataset.
  • ...and 3 more figures