Table of Contents
Fetching ...

Enhancing Free-hand 3D Photoacoustic and Ultrasound Reconstruction using Deep Learning

SiYeoul Lee, SeonHo Kim, Minkyung Seo, SeongKyu Park, Salehin Imrus, Kambaluru Ashok, DongEon Lee, Chunsu Park, SeonYeong Lee, Jiye Kim, Jae-Heung Yoo, MinWoo Kim

TL;DR

This work tackles the challenge of sensorless freehand 3D PAUS reconstruction by introducing MoGLo-Net, a motion-based learning network that combines a global-local self-attention module with a correlation volume to robustly estimate six-degree-of-freedom motion from sequential B-mode frames. The method integrates a patch-wise correlation operation, dual RNN-based motion estimators, and a triad of specialized losses (MMAE, correlation loss, and margin triplet) to achieve accurate, drift-resistant 3D reconstructions, extended to Doppler and photoacoustic imaging for vascular visualization. Extensive experiments on in-house and open datasets demonstrate state-of-the-art performance, real-time inference, and clear ablation-driven insights into the contributions of global-local attention, correlation information, and motion-based supervision. The approach holds potential for clinically practical freehand PAUS imaging, enabling comprehensive 3D vascular visualization without external tracking hardware and with applicability across ultrasound, Doppler, and photoacoustic modalities.

Abstract

This study introduces a motion-based learning network with a global-local self-attention module (MoGLo-Net) to enhance 3D reconstruction in handheld photoacoustic and ultrasound (PAUS) imaging. Standard PAUS imaging is often limited by a narrow field of view and the inability to effectively visualize complex 3D structures. The 3D freehand technique, which aligns sequential 2D images for 3D reconstruction, faces significant challenges in accurate motion estimation without relying on external positional sensors. MoGLo-Net addresses these limitations through an innovative adaptation of the self-attention mechanism, which effectively exploits the critical regions, such as fully-developed speckle area or high-echogenic tissue area within successive ultrasound images to accurately estimate motion parameters. This facilitates the extraction of intricate features from individual frames. Additionally, we designed a patch-wise correlation operation to generate a correlation volume that is highly correlated with the scanning motion. A custom loss function was also developed to ensure robust learning with minimized bias, leveraging the characteristics of the motion parameters. Experimental evaluations demonstrated that MoGLo-Net surpasses current state-of-the-art methods in both quantitative and qualitative performance metrics. Furthermore, we expanded the application of 3D reconstruction technology beyond simple B-mode ultrasound volumes to incorporate Doppler ultrasound and photoacoustic imaging, enabling 3D visualization of vasculature. The source code for this study is publicly available at: https://github.com/guhong3648/US3D

Enhancing Free-hand 3D Photoacoustic and Ultrasound Reconstruction using Deep Learning

TL;DR

This work tackles the challenge of sensorless freehand 3D PAUS reconstruction by introducing MoGLo-Net, a motion-based learning network that combines a global-local self-attention module with a correlation volume to robustly estimate six-degree-of-freedom motion from sequential B-mode frames. The method integrates a patch-wise correlation operation, dual RNN-based motion estimators, and a triad of specialized losses (MMAE, correlation loss, and margin triplet) to achieve accurate, drift-resistant 3D reconstructions, extended to Doppler and photoacoustic imaging for vascular visualization. Extensive experiments on in-house and open datasets demonstrate state-of-the-art performance, real-time inference, and clear ablation-driven insights into the contributions of global-local attention, correlation information, and motion-based supervision. The approach holds potential for clinically practical freehand PAUS imaging, enabling comprehensive 3D vascular visualization without external tracking hardware and with applicability across ultrasound, Doppler, and photoacoustic modalities.

Abstract

This study introduces a motion-based learning network with a global-local self-attention module (MoGLo-Net) to enhance 3D reconstruction in handheld photoacoustic and ultrasound (PAUS) imaging. Standard PAUS imaging is often limited by a narrow field of view and the inability to effectively visualize complex 3D structures. The 3D freehand technique, which aligns sequential 2D images for 3D reconstruction, faces significant challenges in accurate motion estimation without relying on external positional sensors. MoGLo-Net addresses these limitations through an innovative adaptation of the self-attention mechanism, which effectively exploits the critical regions, such as fully-developed speckle area or high-echogenic tissue area within successive ultrasound images to accurately estimate motion parameters. This facilitates the extraction of intricate features from individual frames. Additionally, we designed a patch-wise correlation operation to generate a correlation volume that is highly correlated with the scanning motion. A custom loss function was also developed to ensure robust learning with minimized bias, leveraging the characteristics of the motion parameters. Experimental evaluations demonstrated that MoGLo-Net surpasses current state-of-the-art methods in both quantitative and qualitative performance metrics. Furthermore, we expanded the application of 3D reconstruction technology beyond simple B-mode ultrasound volumes to incorporate Doppler ultrasound and photoacoustic imaging, enabling 3D visualization of vasculature. The source code for this study is publicly available at: https://github.com/guhong3648/US3D

Paper Structure

This paper contains 32 sections, 13 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of our motion-based learning network with a global-local self attention module (MoGLo-Net) structure. Trainable components are denoted by color-filled arrows, signifying the neural networks. Rectangular or cubic shapes represent 2D images or 3D tensors, respectively. The model processes two B-mode sequences and outputs estimates of relative motion vectors $\hat{\Delta\boldsymbol{\theta}_i}$. Vectors or feature maps within dotted boxes contribute to the loss function, while the final estimates facilitate the assembly of 2D images into a 3D volume.
  • Figure 2: Correlation operation. It generates the correlation volume $\mathbf{C}_i$ from two feature maps $(\mathbf{E}_{i}^1, \mathbf{E}_{i+1}^1)$. The dotted box in each map represents the spatial region of interest (RoI), and the filled box represents a 3D patch spanning all channels but covering only part of the spatial RoI. The red patch remains fixed at the center of the RoI, while the black patch moves across the RoI. All possible correlations between the two patches are stored in a 2D array (red dotted box) within the volume. By moving both RoIs across the feature maps, these arrays are stacked to generate the full 3D correlation volume.
  • Figure 3: Global-Local Attention Module. The module recalibrates local features $\mathbf{E}^2$ and global features $\mathbf{E}^4$ to enhance motion estimation. Local feature blocks are extracted from $\mathbf{E}^2$ and recalibrated using channel attention, resulting in refined local feature blocks $\mathbf{R}_{k}$. Global features are derived from $\mathbf{E}^4$ through spatial and channel attention, yielding $\mathbf{G}$, which captures semantic information across the entire image. Each local feature block is weighted based on its similarity to the recalibrated global feature $\mathbf{G}$, and weighted local feature blocks $\mathbf{L_k}$ are projected to aggregate local information. The reshaped local feature $\mathbf{L}$ and global feature $\mathbf{G}$ serve as the final representations, which are input to the motion estimator.
  • Figure 4: Experimental setup for PAUS data acquisition and visualized results. (a) US machine connected to the laser system. (b) and (c) Transducer setup.
  • Figure 5: Two 3D reconstruction cases using US B-mode acquisitions. Each case is shown from two different directional views. The 3D ground-truth image is constructed by stacking 2D B-mode images using ground-truth positions. The differently colored outlined 3D figures (no filled) are constructed using estimated positions from various deep learning models to compare their trajectories with the ground truth.
  • ...and 4 more figures