Table of Contents
Fetching ...

Tables Guide Vision: Learning to See the Heart through Tabular Data

Marta Hasny, Maxime Di Folco, Keno Bressem, Julia Schnabel

TL;DR

TGV tackles false negatives in contrastive learning for medical imaging by using clinically relevant tabular attributes to guide intra-batch pairing, rather than relying solely on image augmentations or joint image-tabular embeddings. It defines a tabular similarity matrix $S = \lambda S_{con} + (1-\lambda) S_{cat}$ to select semantically related image pairs and employs a multi-positive, thresholded loss $L$ to learn robust visual representations. A unimodal zero-shot predictor is built on a representative set via a k-NN approach, enabling prediction without joint embeddings. Experiments on UK Biobank cardiac MR data show superior CAD classification and cardiac phenotype prediction across fine-tuning, linear probing, and zero-shot settings, and the method generalizes to natural images such as cars. Overall, the work demonstrates that rich tabular data can meaningfully supervise visual representation learning and extend zero-shot capabilities in medical contexts, with potential broad applicability.

Abstract

Contrastive learning methods in computer vision typically rely on augmented views of the same image or multimodal pretraining strategies that align paired modalities. However, these approaches often overlook semantic relationships between distinct instances, leading to false negatives when semantically similar samples are treated as negatives. This limitation is especially critical in medical imaging domains such as cardiology, where demographic and clinical attributes play a critical role in assessing disease risk and patient outcomes. We introduce a tabular-guided contrastive learning framework that leverages clinically relevant tabular data to identify patient-level similarities and construct more meaningful pairs, enabling semantically aligned representation learning without requiring joint embeddings across modalities. Additionally, we adapt the k-NN algorithm for zero-shot prediction to overcome the lack of zero-shot capability in unimodal representations. We demonstrate the strength of our methods using a large cohort of short-axis cardiac MR images and clinical attributes, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning, linear probing, and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data guidance yields stronger visual representations than conventional methods that rely solely on image augmentation or combined image-tabular embeddings. Further, we show that our method can generalize to natural images by evaluating it on a car advertisement dataset. The code will be available on GitHub upon acceptance.

Tables Guide Vision: Learning to See the Heart through Tabular Data

TL;DR

TGV tackles false negatives in contrastive learning for medical imaging by using clinically relevant tabular attributes to guide intra-batch pairing, rather than relying solely on image augmentations or joint image-tabular embeddings. It defines a tabular similarity matrix to select semantically related image pairs and employs a multi-positive, thresholded loss to learn robust visual representations. A unimodal zero-shot predictor is built on a representative set via a k-NN approach, enabling prediction without joint embeddings. Experiments on UK Biobank cardiac MR data show superior CAD classification and cardiac phenotype prediction across fine-tuning, linear probing, and zero-shot settings, and the method generalizes to natural images such as cars. Overall, the work demonstrates that rich tabular data can meaningfully supervise visual representation learning and extend zero-shot capabilities in medical contexts, with potential broad applicability.

Abstract

Contrastive learning methods in computer vision typically rely on augmented views of the same image or multimodal pretraining strategies that align paired modalities. However, these approaches often overlook semantic relationships between distinct instances, leading to false negatives when semantically similar samples are treated as negatives. This limitation is especially critical in medical imaging domains such as cardiology, where demographic and clinical attributes play a critical role in assessing disease risk and patient outcomes. We introduce a tabular-guided contrastive learning framework that leverages clinically relevant tabular data to identify patient-level similarities and construct more meaningful pairs, enabling semantically aligned representation learning without requiring joint embeddings across modalities. Additionally, we adapt the k-NN algorithm for zero-shot prediction to overcome the lack of zero-shot capability in unimodal representations. We demonstrate the strength of our methods using a large cohort of short-axis cardiac MR images and clinical attributes, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning, linear probing, and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data guidance yields stronger visual representations than conventional methods that rely solely on image augmentation or combined image-tabular embeddings. Further, we show that our method can generalize to natural images by evaluating it on a car advertisement dataset. The code will be available on GitHub upon acceptance.

Paper Structure

This paper contains 38 sections, 8 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Contrastive learning methods typically assume a one-to-one correspondence between positive pairs during training, an assumption present in both image augmentation and tabular supervision (similar to CLIP radford2021learning, but with tabular data instead of text) approaches. However, that can result in treating clinically similar patients as negative pairs, leading to false negatives.
  • Figure 2: Comparison of the proposed tabular guidance (TGV: Tables Guide Vision) approach against other contractive learning approaches. Instead of relying on different views of the same image for training, as in image augmentation approaches, TGV defines pairs between different subjects in the batch based on their tabular similarity. Tabular guidance operates without the need for a joint embedding space between images and tabular data, which is the case in tabular supervision.
  • Figure 3: Performance of the models under different amount of training data on LVEF prediction (left, lower is better) and multilabel CAD classification (right, higher is better) tasks. The performance is evaluated using fine-tuning (solid lines), linear probing (dashed lines), and zero-shot prediction (stars).
  • Figure 4: t-SNE visualization of the feature embedding generated with TGV. Sex, LVEF, and LVEDV have been included as attributes for calculation of tabular similarity during training. Height was not included.
  • Figure 5: Performance of all evaluated methods on LVEF prediction and CAD classification using zero-shot prediction, and linear probing and fine-tuning under limited data regimes. The LVEF prediction result of BYOL grill2020bootstrap with fine-tuning at 1% is clipped; the value is 26.68. The reported results are obtained using strategies: fine-tuning (solid lines), linear probing (dashed lines), and zero-shot prediction (stars), applied consistently across all included methods.
  • ...and 2 more figures