Table of Contents
Fetching ...

Align Your Query: Representation Alignment for Multimodality Medical Object Detection

Ara Seo, Bryan Sangwoo Kim, Hyungjin Chung, Jong Chul Ye

TL;DR

This work tackles the challenge of multimodality medical object detection by aligning DETR-style object queries with modality context. It introduces modality tokens derived from text encoders, a lightweight Multimodality Context Attention (MoCA) mechanism to fuse these tokens into the query set, and a pretraining stage called QueryREPA that uses a contrastive loss with modality-balanced batches to align queries to modality tokens. The combination yields modality-aware, class-faithful queries that transfer effectively across diverse modalities with minimal architectural changes and negligible latency. Experiments on a large, mixed-modality dataset show state-of-the-art AP improvements over strong DETR-based baselines across multiple encoders, underscoring the practical value for robust multimodality detection in clinical settings.

Abstract

Medical object detection suffers when a single detector is trained on mixed medical modalities (e.g., CXR, CT, MRI) due to heterogeneous statistics and disjoint representation spaces. To address this challenge, we turn to representation alignment, an approach that has proven effective for bringing features from different sources into a shared space. Specifically, we target the representations of DETR-style object queries and propose a simple, detector-agnostic framework to align them with modality context. First, we define modality tokens: compact, text-derived embeddings encoding imaging modality that are lightweight and require no extra annotations. We integrate the modality tokens into the detection process via Multimodality Context Attention (MoCA), mixing object-query representations via self-attention to propagate modality context within the query set. This preserves DETR-style architectures and adds negligible latency while injecting modality cues into object queries. We further introduce QueryREPA, a short pretraining stage that aligns query representations to their modality tokens using a task-specific contrastive objective with modality-balanced batches. Together, MoCA and QueryREPA produce modality-aware, class-faithful queries that transfer effectively to downstream training. Across diverse modalities trained altogether, the proposed approach consistently improves AP with minimal overhead and no architectural modifications, offering a practical path toward robust multimodality medical object detection. Project page: https://araseo.github.io/alignyourquery/.

Align Your Query: Representation Alignment for Multimodality Medical Object Detection

TL;DR

This work tackles the challenge of multimodality medical object detection by aligning DETR-style object queries with modality context. It introduces modality tokens derived from text encoders, a lightweight Multimodality Context Attention (MoCA) mechanism to fuse these tokens into the query set, and a pretraining stage called QueryREPA that uses a contrastive loss with modality-balanced batches to align queries to modality tokens. The combination yields modality-aware, class-faithful queries that transfer effectively across diverse modalities with minimal architectural changes and negligible latency. Experiments on a large, mixed-modality dataset show state-of-the-art AP improvements over strong DETR-based baselines across multiple encoders, underscoring the practical value for robust multimodality detection in clinical settings.

Abstract

Medical object detection suffers when a single detector is trained on mixed medical modalities (e.g., CXR, CT, MRI) due to heterogeneous statistics and disjoint representation spaces. To address this challenge, we turn to representation alignment, an approach that has proven effective for bringing features from different sources into a shared space. Specifically, we target the representations of DETR-style object queries and propose a simple, detector-agnostic framework to align them with modality context. First, we define modality tokens: compact, text-derived embeddings encoding imaging modality that are lightweight and require no extra annotations. We integrate the modality tokens into the detection process via Multimodality Context Attention (MoCA), mixing object-query representations via self-attention to propagate modality context within the query set. This preserves DETR-style architectures and adds negligible latency while injecting modality cues into object queries. We further introduce QueryREPA, a short pretraining stage that aligns query representations to their modality tokens using a task-specific contrastive objective with modality-balanced batches. Together, MoCA and QueryREPA produce modality-aware, class-faithful queries that transfer effectively to downstream training. Across diverse modalities trained altogether, the proposed approach consistently improves AP with minimal overhead and no architectural modifications, offering a practical path toward robust multimodality medical object detection. Project page: https://araseo.github.io/alignyourquery/.

Paper Structure

This paper contains 44 sections, 2 theorems, 19 equations, 10 figures, 5 tables.

Key Result

Proposition 1

Given the InfoNCE objective of Eq. (eq:infonce) on positive pairs $(U^{(l)}, V)$ and $K$ in-batch negatives $\{ V_k \}$, the standard lower bound is expressed as

Figures (10)

  • Figure 1: (a) Contrastive Representation Alignment of Object Queries (QueryREPA): During a pretraining stage, a text-derived modality token is selected for each image in a modality-balanced batch (e.g., a CXR and an MRI). Object query representations are aligned with the modality token through a contrastive alignment loss to produce queries aware of modality context. (b) Multimodality Context Attention (MoCA): For a given image, a modality token is selected and concatenated with the set of object queries to form an augmented query set. Information fusion occurs in the self-attention layer of the decoder, allowing each object query to attend to the modality token to explicitly attain modality-specific context.
  • Figure 2: Comparison of text/modality integration in DETR-style detectors. (a) DINO (baseline): image-only encoder–decoder with query selection. (b) Grounding DINO: language-guided query selection and cross-attention in the decoder (object queries attend to a sequence of text tokens). (c) Ours (DINO+MoCA): append compact modality tokens to the object queries and fuse by self-attention in the decoder; the tokens act as semantic anchors that refine queries in a modality-aware way with minimal overhead.
  • Figure 3: Qualitative Comparison. Comparison results between various state-of-the-art detection methods and the proposed method is shown above. Our method effectively leverages modality context to significantly enhance anomaly localization (highlighted in red), compared to baseline results (highlighted in blue). Ground truth bounding boxes are highlighted in green. For cases where the bounding boxes are small, insets show an enlarged view of the highlighted yellow region.
  • Figure 4: UMAP embedding of modality tokens for varying ${\mathcal{E}}$.
  • Figure 5: List of 27 categories used in our experiments.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 1
  • proof