Med-Query: Steerable Parsing of 9-DoF Medical Anatomies with Query Embedding

Heng Guo; Jianfeng Zhang; Ke Yan; Le Lu; Minfeng Xu

Med-Query: Steerable Parsing of 9-DoF Medical Anatomies with Query Embedding

Heng Guo, Jianfeng Zhang, Ke Yan, Le Lu, Minfeng Xu

TL;DR

Med-Query introduces a steerable 9-DoF anatomy parsing framework for CT scans that unifies ROI-based detection with a Transformer-based one-stage detector and a dedicated segmentation head. By using a weighted adjacency matrix to enforce fixed query bindings and a 9-DoF box parameterization, it achieves accurate instance detection and labeling across ribs, vertebrae, and abdominal organs while maintaining fast inference. The approach yields state-of-the-art rib parsing performance on RibInst and competitive results on spine and multi-organ segmentation, and it includes a new RibInst dataset to support future research. This work advances 3D medical image analysis by enabling targeted, efficient, query-driven parsing of anatomical structures.

Abstract

Automatic parsing of human anatomies at the instance-level from 3D computed tomography (CT) is a prerequisite step for many clinical applications. The presence of pathologies, broken structures or limited field-of-view (FOV) can all make anatomy parsing algorithms vulnerable. In this work, we explore how to leverage and implement the successful detection-then-segmentation paradigm for 3D medical data, and propose a steerable, robust, and efficient computing framework for detection, identification, and segmentation of anatomies in CT scans. Considering the complicated shapes, sizes, and orientations of anatomies, without loss of generality, we present a nine degrees of freedom (9-DoF) pose estimation solution in full 3D space using a novel single-stage, non-hierarchical representation. Our whole framework is executed in a steerable manner where any anatomy of interest can be directly retrieved to further boost inference efficiency. We have validated our method on three medical imaging parsing tasks: ribs, spine, and abdominal organs. For rib parsing, CT scans have been annotated at the rib instance-level for quantitative evaluation, similarly for spine vertebrae and abdominal organs. Extensive experiments on 9-DoF box detection and rib instance segmentation demonstrate the high efficiency and effectiveness of our framework (with the identification rate of 97.0% and the segmentation Dice score of 90.9%), compared favorably against several strong baselines (e.g., CenterNet, FCOS, and nnU-Net). For spine parsing and abdominal multi-organ segmentation, our method achieves competitive results on par with state-of-the-art methods on the public CTSpine1K dataset and FLARE22 competition, respectively. Our annotations, code, and models are available at: https://github.com/alibaba-damo-academy/Med_Query.

Med-Query: Steerable Parsing of 9-DoF Medical Anatomies with Query Embedding

TL;DR

Abstract

Paper Structure (18 sections, 3 equations, 7 figures, 5 tables)

This paper contains 18 sections, 3 equations, 7 figures, 5 tables.

Introduction
Related Work
Approach
Problem Definition
9-DoF Box Parameterization
Med-Query Architecture
Data Augmentation
Experiments
Datasets
RibInst Curation Details
Training Details
Performance Metrics
Main Results
Ablation Study
Discussion
...and 3 more sections

Figures (7)

Figure 1: Illustration of the steerable anatomy parsing concept. (a) Input 3D CT scan. (b) Target rib identified by query. (c) Target vertebra identified by query. (d) Identifying pancreas and spleen simultaneously. The entity in the query text will be mapped to its corresponding learned query embedding. Best viewed in color.
Figure 2: An instantiation of Med-Query architecture for rib parsing, which consists of A: a ribcage ROI extractor, B: a steerable 9-DoF parametric rib detector, and C: a stand-alone segmentation head, for robust and efficient rib parsing, i.e., instance segmentation and labeling. In (B) Anatomy Detector, we use an adapted 3D version of ResNet he2016deep as feature extractor. The stacked colored illustrative blocks next to the feature extractor represents the flattened spatial features. A set of $q_i$ constitutes the queries, and a set of $x_i$ constitutes the targets.
Figure 3: The mappings between queries and ground-truth boxes in a transformer-based detector may be random (a)(b). We intend to obtain a fixed binding outcome (c) via a weighted adjacency matrix during training. (d) shows an instantiation of the weighted adjacency matrix with 10 queries (represented by rows) and 10 ground-truth classes (represented by columns).
Figure 4: Detection visualizations show that our 9-DoF predictions enclose the ground-truth rib masks accurately. (a) Normal results in superior-to-inferior view. (b) A limited FOV case in posterior-to-anterior view. (c) A case with rib adhesions. (d) Only odd-number labels are queried. Ground-truth masks are rendered as visual reference.
Figure 5: An example with broken structures in RibInst. Missing or wrong labels are marked using golden arrows and dashed circles, respectively.
...and 2 more figures

Med-Query: Steerable Parsing of 9-DoF Medical Anatomies with Query Embedding

TL;DR

Abstract

Med-Query: Steerable Parsing of 9-DoF Medical Anatomies with Query Embedding

Authors

TL;DR

Abstract

Table of Contents

Figures (7)