Tables Guide Vision: Learning to See the Heart through Tabular Data
Marta Hasny, Maxime Di Folco, Keno Bressem, Julia Schnabel
TL;DR
TGV tackles false negatives in contrastive learning for medical imaging by using clinically relevant tabular attributes to guide intra-batch pairing, rather than relying solely on image augmentations or joint image-tabular embeddings. It defines a tabular similarity matrix $S = \lambda S_{con} + (1-\lambda) S_{cat}$ to select semantically related image pairs and employs a multi-positive, thresholded loss $L$ to learn robust visual representations. A unimodal zero-shot predictor is built on a representative set via a k-NN approach, enabling prediction without joint embeddings. Experiments on UK Biobank cardiac MR data show superior CAD classification and cardiac phenotype prediction across fine-tuning, linear probing, and zero-shot settings, and the method generalizes to natural images such as cars. Overall, the work demonstrates that rich tabular data can meaningfully supervise visual representation learning and extend zero-shot capabilities in medical contexts, with potential broad applicability.
Abstract
Contrastive learning methods in computer vision typically rely on augmented views of the same image or multimodal pretraining strategies that align paired modalities. However, these approaches often overlook semantic relationships between distinct instances, leading to false negatives when semantically similar samples are treated as negatives. This limitation is especially critical in medical imaging domains such as cardiology, where demographic and clinical attributes play a critical role in assessing disease risk and patient outcomes. We introduce a tabular-guided contrastive learning framework that leverages clinically relevant tabular data to identify patient-level similarities and construct more meaningful pairs, enabling semantically aligned representation learning without requiring joint embeddings across modalities. Additionally, we adapt the k-NN algorithm for zero-shot prediction to overcome the lack of zero-shot capability in unimodal representations. We demonstrate the strength of our methods using a large cohort of short-axis cardiac MR images and clinical attributes, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning, linear probing, and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data guidance yields stronger visual representations than conventional methods that rely solely on image augmentation or combined image-tabular embeddings. Further, we show that our method can generalize to natural images by evaluating it on a car advertisement dataset. The code will be available on GitHub upon acceptance.
