Table of Contents
Fetching ...

MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer

Divyanshu Mishra, Pramit Saha, He Zhao, Netzahualcoyotl Hernandez-Cruz, Olga Patey, Aris Papageorghiou, J. Alison Noble

TL;DR

MCAT introduces a visual-query-based video clip localization framework for fetal ultrasound, addressing the challenge of locating standard planes within dynamic video sweeps. By employing a Multi-Tier Class-Aware Transformer with tier-specific tokens and a cross-attention-based fusion of video and visual query features, the method achieves accurate start/end frame localization while using significantly fewer tokens. The dual-loss design, combining a Multi-Tier Dual Anchor Contrastive Loss with a Temporal Uncertainty-Aware Localization Loss, handles fine-grained class distinctions and annotation noise. Empirical results on real ultrasound datasets and Ego4D demonstrate superior localization performance and efficiency, suggesting practical benefits for prenatal care in resource-limited settings where rapid, reliable standard plane acquisition is vital.

Abstract

Accurate standard plane acquisition in fetal ultrasound (US) videos is crucial for fetal growth assessment, anomaly detection, and adherence to clinical guidelines. However, manually selecting standard frames is time-consuming and prone to intra- and inter-sonographer variability. Existing methods primarily rely on image-based approaches that capture standard frames and then classify the input frames across different anatomies. This ignores the dynamic nature of video acquisition and its interpretation. To address these challenges, we introduce Multi-Tier Class-Aware Token Transformer (MCAT), a visual query-based video clip localization (VQ-VCL) method, to assist sonographers by enabling them to capture a quick US sweep. By then providing a visual query of the anatomy they wish to analyze, MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies. We evaluate MCAT on two ultrasound video datasets and a natural image VQ-VCL dataset based on Ego4D. Our model outperforms state-of-the-art methods by 10% and 13% mIoU on the ultrasound datasets and by 5.35% mIoU on the Ego4D dataset, using 96% fewer tokens. MCAT's efficiency and accuracy have significant potential implications for public health, especially in low- and middle-income countries (LMICs), where it may enhance prenatal care by streamlining standard plane acquisition, simplifying US-based screening, diagnosis and allowing sonographers to examine more patients.

MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer

TL;DR

MCAT introduces a visual-query-based video clip localization framework for fetal ultrasound, addressing the challenge of locating standard planes within dynamic video sweeps. By employing a Multi-Tier Class-Aware Transformer with tier-specific tokens and a cross-attention-based fusion of video and visual query features, the method achieves accurate start/end frame localization while using significantly fewer tokens. The dual-loss design, combining a Multi-Tier Dual Anchor Contrastive Loss with a Temporal Uncertainty-Aware Localization Loss, handles fine-grained class distinctions and annotation noise. Empirical results on real ultrasound datasets and Ego4D demonstrate superior localization performance and efficiency, suggesting practical benefits for prenatal care in resource-limited settings where rapid, reliable standard plane acquisition is vital.

Abstract

Accurate standard plane acquisition in fetal ultrasound (US) videos is crucial for fetal growth assessment, anomaly detection, and adherence to clinical guidelines. However, manually selecting standard frames is time-consuming and prone to intra- and inter-sonographer variability. Existing methods primarily rely on image-based approaches that capture standard frames and then classify the input frames across different anatomies. This ignores the dynamic nature of video acquisition and its interpretation. To address these challenges, we introduce Multi-Tier Class-Aware Token Transformer (MCAT), a visual query-based video clip localization (VQ-VCL) method, to assist sonographers by enabling them to capture a quick US sweep. By then providing a visual query of the anatomy they wish to analyze, MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies. We evaluate MCAT on two ultrasound video datasets and a natural image VQ-VCL dataset based on Ego4D. Our model outperforms state-of-the-art methods by 10% and 13% mIoU on the ultrasound datasets and by 5.35% mIoU on the Ego4D dataset, using 96% fewer tokens. MCAT's efficiency and accuracy have significant potential implications for public health, especially in low- and middle-income countries (LMICs), where it may enhance prenatal care by streamlining standard plane acquisition, simplifying US-based screening, diagnosis and allowing sonographers to examine more patients.

Paper Structure

This paper contains 25 sections, 10 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: (a) Self-similarity matrix for a randomly chosen video from Ego4D (top, mean=0.3) grauman2022ego4d and our clinical video dataset (bottom, mean=0.75), which reveals higher task difficulty for our video clip localization task. The uncertainty in the annotations of two expert cardiologists is shown in green and blue boxes. (b) Cosine similarity of the visual query with the video for both Ego4D (top) and our data (bottom). Our clinical data obtains similar scores along the video emphasizing the challenge, whereas Ego4D exhibits high scores only within region of interest.
  • Figure 2: Main architecture of MCAT. The input video $v$ and visual query $q$ are passed to the visual backbone to give multi-Tier features. These features are fused spatially using the Multi-Tier Query Aware Spatial Transformer. The Tier-specific features are passed to a) ${\mathcal{L}_{MTDA}}$ to learn the separation between classes, b) the Multi-Tier Temporal Fusion transformer to learn Tier-Aware Spatio-Temporal Embedding, which is further passed to an MLP to make final prediction and calculate $\mathcal{L}_{URL}$ loss.
  • Figure 3: Fig (left) shows the spatial feature fusion mechanism where Tier-specific video and VQ features are spatially fused to give Tier-specific query-aware features. Figure 3 (right) shows how the Tier-specific query-aware features are first resized, flattened and enriched with positional information. The resulting features are concatenated and fused to learn the Tier-Aware Spatio-Temporal Embedding.
  • Figure 4: This figure compares the predictions of a single-tier model with our multi-tier model for an LVOT visual query.