Table of Contents
Fetching ...

BOTM: Echocardiography Segmentation via Bi-directional Optimal Token Matching

Zhihua Liu, Lei Tong, Xilin He, Che Liu, Rossella Arcucci, Chen Jin, Huiyu Zhou

TL;DR

This work tackles echocardiography segmentation under challenging noise and anatomy-variation by enforcing cross-frame anatomical consistency through token-level optimal transport. It introduces BOTM, which pairs a shared vision-transformer encoder with an OT-based token matching map $ \mathbf{T}^{\star} $ and a Bi-directional Cross-Transport Attention proxy that leverages forward and backward transport to refine token embeddings. Empirically, BOTM delivers state-of-the-art or competitive results on CAMUS and TED datasets, with notable reductions in mean Hausdorff Distance and improvements in Dice, while showing robustness to artifacts and data limitations. By avoiding heavy ad-hoc adapters and operating on patch-level tokens, BOTM provides better anatomical coherence and interpretability for temporal echocardiography segmentation.

Abstract

Existed echocardiography segmentation methods often suffer from anatomical inconsistency challenge caused by shape variation, partial observation and region ambiguity with similar intensity across 2D echocardiographic sequences, resulting in false positive segmentation with anatomical defeated structures in challenging low signal-to-noise ratio conditions. To provide a strong anatomical guarantee across different echocardiographic frames, we propose a novel segmentation framework named BOTM (Bi-directional Optimal Token Matching) that performs echocardiography segmentation and optimal anatomy transportation simultaneously. Given paired echocardiographic images, BOTM learns to match two sets of discrete image tokens by finding optimal correspondences from a novel anatomical transportation perspective. We further extend the token matching into a bi-directional cross-transport attention proxy to regulate the preserved anatomical consistency within the cardiac cyclic deformation in temporal domain. Extensive experimental results show that BOTM can generate stable and accurate segmentation outcomes (e.g. -1.917 HD on CAMUS2H LV, +1.9% Dice on TED), and provide a better matching interpretation with anatomical consistency guarantee.

BOTM: Echocardiography Segmentation via Bi-directional Optimal Token Matching

TL;DR

This work tackles echocardiography segmentation under challenging noise and anatomy-variation by enforcing cross-frame anatomical consistency through token-level optimal transport. It introduces BOTM, which pairs a shared vision-transformer encoder with an OT-based token matching map and a Bi-directional Cross-Transport Attention proxy that leverages forward and backward transport to refine token embeddings. Empirically, BOTM delivers state-of-the-art or competitive results on CAMUS and TED datasets, with notable reductions in mean Hausdorff Distance and improvements in Dice, while showing robustness to artifacts and data limitations. By avoiding heavy ad-hoc adapters and operating on patch-level tokens, BOTM provides better anatomical coherence and interpretability for temporal echocardiography segmentation.

Abstract

Existed echocardiography segmentation methods often suffer from anatomical inconsistency challenge caused by shape variation, partial observation and region ambiguity with similar intensity across 2D echocardiographic sequences, resulting in false positive segmentation with anatomical defeated structures in challenging low signal-to-noise ratio conditions. To provide a strong anatomical guarantee across different echocardiographic frames, we propose a novel segmentation framework named BOTM (Bi-directional Optimal Token Matching) that performs echocardiography segmentation and optimal anatomy transportation simultaneously. Given paired echocardiographic images, BOTM learns to match two sets of discrete image tokens by finding optimal correspondences from a novel anatomical transportation perspective. We further extend the token matching into a bi-directional cross-transport attention proxy to regulate the preserved anatomical consistency within the cardiac cyclic deformation in temporal domain. Extensive experimental results show that BOTM can generate stable and accurate segmentation outcomes (e.g. -1.917 HD on CAMUS2H LV, +1.9% Dice on TED), and provide a better matching interpretation with anatomical consistency guarantee.

Paper Structure

This paper contains 10 sections, 3 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Illustration of various echocardiography segmentation challenges and performance comparison: (a) Representative examples of echocardiography segmentation challenges, including shape variation across individuals, partial observations due to limited field-of-view, and visual ambiguity in regions with similar intensities. (b) Qualitative comparison of segmentation performance. Our BOTM can achieve accurate segmentation, whereas others are affected by different types of noise, resulting in anatomical defeated mask.
  • Figure 2: Key comparison of echocardiography segmentation methods, including traditional end-to-end methods wu2022fathe2023h2former with (a) single frame input and (b) image pair input; (c) SAM with echocardiography domain adapter including lin2024beyondgowda2024cc; (d) motion tracking based segmentation methods including kim2022diffusemorphyang2024bidirectional and (e) our token-matching based method, which regulates anatomical information transportation through a simple-yet-effective token matching pipeline for better semantical and anatomical structural coherent segmentation.
  • Figure 2: Quantitative results comparison on CAMUS4CH. Results are averaged across all regions and all frames.
  • Figure 3: Pipeline of the proposed BOTM: We first extract frame-dependent token embeddings using a vision transformer. At each embedding stage within the shared vision transformer encoder, BOTM learns optimal token matching correspondences via a bi-directional cross-transport attention mechanism. This serves as a proxy module to enforce implicit anatomical consistency across echocardiographic frames during the embedding process.
  • Figure 4: (a) Given a pair of image token embeddings, we first compute an optimal transport plan using Sinkhorn iterations, where each entry represents a matching probability derived from cosine similarity between embedding instances. (b) We then refine the paired embeddings via a cross-attention mechanism that incorporates the previously computed transport plan. A learnable anatomical importance mask is applied to suppress regions with high matching probability but low anatomical relevance, such as background areas.
  • ...and 6 more figures