Align Your Query: Representation Alignment for Multimodality Medical Object Detection
Ara Seo, Bryan Sangwoo Kim, Hyungjin Chung, Jong Chul Ye
TL;DR
This work tackles the challenge of multimodality medical object detection by aligning DETR-style object queries with modality context. It introduces modality tokens derived from text encoders, a lightweight Multimodality Context Attention (MoCA) mechanism to fuse these tokens into the query set, and a pretraining stage called QueryREPA that uses a contrastive loss with modality-balanced batches to align queries to modality tokens. The combination yields modality-aware, class-faithful queries that transfer effectively across diverse modalities with minimal architectural changes and negligible latency. Experiments on a large, mixed-modality dataset show state-of-the-art AP improvements over strong DETR-based baselines across multiple encoders, underscoring the practical value for robust multimodality detection in clinical settings.
Abstract
Medical object detection suffers when a single detector is trained on mixed medical modalities (e.g., CXR, CT, MRI) due to heterogeneous statistics and disjoint representation spaces. To address this challenge, we turn to representation alignment, an approach that has proven effective for bringing features from different sources into a shared space. Specifically, we target the representations of DETR-style object queries and propose a simple, detector-agnostic framework to align them with modality context. First, we define modality tokens: compact, text-derived embeddings encoding imaging modality that are lightweight and require no extra annotations. We integrate the modality tokens into the detection process via Multimodality Context Attention (MoCA), mixing object-query representations via self-attention to propagate modality context within the query set. This preserves DETR-style architectures and adds negligible latency while injecting modality cues into object queries. We further introduce QueryREPA, a short pretraining stage that aligns query representations to their modality tokens using a task-specific contrastive objective with modality-balanced batches. Together, MoCA and QueryREPA produce modality-aware, class-faithful queries that transfer effectively to downstream training. Across diverse modalities trained altogether, the proposed approach consistently improves AP with minimal overhead and no architectural modifications, offering a practical path toward robust multimodality medical object detection. Project page: https://araseo.github.io/alignyourquery/.
