Table of Contents
Fetching ...

VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging

Andre Dourson, Kylie Taylor, Xiaoli Qiao, Michael Fitzke

TL;DR

VET-DINO addresses the challenge of learning anatomical representations in medical imaging with limited labels by exploiting the natural multi-view nature of radiographic studies. It extends self-supervised distillation (DINO/DINOv2) to learn from real multi-view pairs within a study, using two images per study to generate crops for a student ViT while the teacher receives global crops from a single view, with an EMA target. On a massive canine radiograph dataset, VET-DINO demonstrates state-of-the-art performance on downstream tasks, improving both representation quality (k-NN) and supervised fine-tuning metrics (Avg Prec, ROC AUC) over single-view and ImageNet-pretrained baselines. Analyses of attention maps and patch embedding similarity provide evidence that the model learns view-invariant, anatomically meaningful representations, suggesting a pathway to 3D-aware understanding from 2D radiographs and domain-specific self-supervised learning for veterinary imaging.

Abstract

Self-supervised learning has emerged as a powerful paradigm for training deep neural networks, particularly in medical imaging where labeled data is scarce. While current approaches typically rely on synthetic augmentations of single images, we propose VET-DINO, a framework that leverages a unique characteristic of medical imaging: the availability of multiple standardized views from the same study. Using a series of clinical veterinary radiographs from the same patient study, we enable models to learn view-invariant anatomical structures and develop an implied 3D understanding from 2D projections. We demonstrate our approach on a dataset of 5 million veterinary radiographs from 668,000 canine studies. Through extensive experimentation, including view synthesis and downstream task performance, we show that learning from real multi-view pairs leads to superior anatomical understanding compared to purely synthetic augmentations. VET-DINO achieves state-of-the-art performance on various veterinary imaging tasks. Our work establishes a new paradigm for self-supervised learning in medical imaging that leverages domain-specific properties rather than merely adapting natural image techniques.

VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging

TL;DR

VET-DINO addresses the challenge of learning anatomical representations in medical imaging with limited labels by exploiting the natural multi-view nature of radiographic studies. It extends self-supervised distillation (DINO/DINOv2) to learn from real multi-view pairs within a study, using two images per study to generate crops for a student ViT while the teacher receives global crops from a single view, with an EMA target. On a massive canine radiograph dataset, VET-DINO demonstrates state-of-the-art performance on downstream tasks, improving both representation quality (k-NN) and supervised fine-tuning metrics (Avg Prec, ROC AUC) over single-view and ImageNet-pretrained baselines. Analyses of attention maps and patch embedding similarity provide evidence that the model learns view-invariant, anatomically meaningful representations, suggesting a pathway to 3D-aware understanding from 2D radiographs and domain-specific self-supervised learning for veterinary imaging.

Abstract

Self-supervised learning has emerged as a powerful paradigm for training deep neural networks, particularly in medical imaging where labeled data is scarce. While current approaches typically rely on synthetic augmentations of single images, we propose VET-DINO, a framework that leverages a unique characteristic of medical imaging: the availability of multiple standardized views from the same study. Using a series of clinical veterinary radiographs from the same patient study, we enable models to learn view-invariant anatomical structures and develop an implied 3D understanding from 2D projections. We demonstrate our approach on a dataset of 5 million veterinary radiographs from 668,000 canine studies. Through extensive experimentation, including view synthesis and downstream task performance, we show that learning from real multi-view pairs leads to superior anatomical understanding compared to purely synthetic augmentations. VET-DINO achieves state-of-the-art performance on various veterinary imaging tasks. Our work establishes a new paradigm for self-supervised learning in medical imaging that leverages domain-specific properties rather than merely adapting natural image techniques.

Paper Structure

This paper contains 27 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Multi-view VET-DINO architecture. Two radiographic views are randomly selected from a single canine study (a set of radiographs from the same imaging session). From each view, two global crops and ten local crops are extracted and resized to 224x224 pixels and 98x98 pixels, respectively. All crops are passed to the student Vision Transformer (ViT) network, while only the global crops from one randomly chosen view are passed to the teacher ViT network. The student network is trained to match the output of the teacher network, which receives a more "global" perspective. The teacher's weights are an exponential moving average (EMA) of the student's weights. This process enables VET-DINO to learn view-invariant anatomical representations without manual annotations.
  • Figure 2: Samples of radiographs from the validation dataset.
  • Figure 3: The above figure displays ventrodorsal and left lateral radiographic projections of a canine patient acquired during the same visit. Superimposed on these images are attention maps derived from the final block head of the Multi-view VET-DINO model, illustrating the model's focus on specific anatomical regions. Notably, distinct attention heads appear to be independently attending to the skeletal system (green), the soft tissue/muscular system (red), and the gastrointestinal system (yellow). This observation suggests the model exhibits view-invariant attention to anatomical structures although imperfectly, highlighting its capacity to learn consistent representations of anatomical features across different radiographic views.
  • Figure 4: Comparative analysis of the Multi-view VET-DINO model against a Single-view VET-DINO model and the original DINOv2, using identical radiographic input, demonstrates the superior ability of the Multi-view VET-DINO model to attend to relevant anatomical structures in canine patients. We see that the Multi-view VET-DINO effectively attends to the skeletal system (green), soft tissue (red), and intestinal system (yellow). This observation underscores the hypothesis that fine-tuning with multi-view radiographic studies enhances the model's learning capacity and facilitates the development of more robust and comprehensive anatomical representations.
  • Figure 5: Visualization of cosine similarity between patch embeddings for a multi-view VET-DINO model, single-view VET-DINO, and untuned DINOv2. The anchor patch (red box) and comparison image originate from the same patient and radiographic session. Top-5 most similar patches are highlighted, demonstrating superior view-invariant anatomical feature identification in the multi-view model, as compared to the single-view model and original DINOv2.
  • ...and 3 more figures