Cross-attention learning enables real-time nonuniform rotational distortion correction in OCT

Haoran Zhang; Jianlong Yang; Jingqian Zhang; Shiqing Zhao; Aili Zhang

Cross-attention learning enables real-time nonuniform rotational distortion correction in OCT

Haoran Zhang, Jianlong Yang, Jingqian Zhang, Shiqing Zhao, Aili Zhang

TL;DR

This work addresses the real-time correction of nonuniform rotational distortion (NURD) in endoscopic OCT by introducing a stacked cross-attention network that models long-range dependencies between OCT A-lines. Framed as a self-supervised problem, the method predicts bi-directional A-line distortions and uses a cumulative transform to correct frames, guided by a combined loss with $\mathcal{L}_1$, $\mathcal{L}_{sm}$, and $\mathcal{L}_{si}$. Across synthetic and real datasets, it achieves superior correction accuracy and robustness while delivering real-time performance (~$26\pm3$ fps), outperforming feature-based, DP, and CNN-based baselines. The approach demonstrates strong potential for real-time OCT applications in surgical navigation and functional imaging, with discussions on data requirements and domain generalization.

Abstract

Nonuniform rotational distortion (NURD) correction is vital for endoscopic optical coherence tomography (OCT) imaging and its functional extensions, such as angiography and elastography. Current NURD correction methods require time-consuming feature tracking or cross-correlation calculations and thus sacrifice temporal resolution. Here we propose a cross-attention learning method for the NURD correction in OCT. Our method is inspired by the recent success of the self-attention mechanism in natural language processing and computer vision. By leveraging its ability to model long-range dependencies, we can directly obtain the correlation between OCT A-lines at any distance, thus accelerating the NURD correction. We develop an end-to-end stacked cross-attention network and design three types of optimization constraints. We compare our method with two traditional feature-based methods and a CNN-based method, on two publicly-available endoscopic OCT datasets and a private dataset collected on our home-built endoscopic OCT system. Our method achieved a $\sim3\times$ speedup to real time ($26\pm 3$ fps), and superior correction performance.

Cross-attention learning enables real-time nonuniform rotational distortion correction in OCT

TL;DR

, and

. Across synthetic and real datasets, it achieves superior correction accuracy and robustness while delivering real-time performance (~

fps), outperforming feature-based, DP, and CNN-based baselines. The approach demonstrates strong potential for real-time OCT applications in surgical navigation and functional imaging, with discussions on data requirements and domain generalization.

Abstract

speedup to real time (

fps), and superior correction performance.

Paper Structure (14 sections, 7 equations, 12 figures, 5 tables)

This paper contains 14 sections, 7 equations, 12 figures, 5 tables.

Introduction
Methods and materials
Overall framework
Stacked cross-attention network
Datasets and implementations
Results
Accuracy assessment of NURD correction
Robustness assessment of NURD correction
Correction performance on 3D stent imaging
Influence of training data
Ablation studies
Evaluation of processing speed
Discussion
Conclusions

Figures (12)

Figure 1: Overall framework of our proposed method. (a) and (b) illustrate its training and inference phases, respectively.
Figure 2: Illustration of the stacked cross-attention network. The upper panel is the overall architecture and the lower dashed box is the details of the multi-head cross-attention module.
Figure 3: (a) Photograph of our assembled proximal scanning micro-probe used for endoscopic OCT imaging. (b) Enlarged view of the black box in (a). (c) Photograph of the intravascular stent used in OCT imaging.
Figure 4: Quantitative comparison of different NURD correction methods using the two synthetic sequences. (a) is the result of the pig bronchus data. (b) is the result of the human nasopharynx data.
Figure 5: The NURD performance of two synthetic sequences. (a) is the result of the pig bronchus data. (b) is the result of the human nasopharynx data. (c) and (d) are the results after removing the tissue-unrelated features, such as sheath. The left column gives the original B-frames used to create the synthetic sequences. The middle column shows the axial maximum value projection of the synthetic sequences, which gives better views of the applied NURD. The right column gives the NURD-corrected synthetic sequences using our method.
...and 7 more figures

Cross-attention learning enables real-time nonuniform rotational distortion correction in OCT

TL;DR

Abstract

Cross-attention learning enables real-time nonuniform rotational distortion correction in OCT

Authors

TL;DR

Abstract

Table of Contents

Figures (12)