VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging

Andre Dourson; Kylie Taylor; Xiaoli Qiao; Michael Fitzke

VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging

Andre Dourson, Kylie Taylor, Xiaoli Qiao, Michael Fitzke

TL;DR

VET-DINO addresses the challenge of learning anatomical representations in medical imaging with limited labels by exploiting the natural multi-view nature of radiographic studies. It extends self-supervised distillation (DINO/DINOv2) to learn from real multi-view pairs within a study, using two images per study to generate crops for a student ViT while the teacher receives global crops from a single view, with an EMA target. On a massive canine radiograph dataset, VET-DINO demonstrates state-of-the-art performance on downstream tasks, improving both representation quality (k-NN) and supervised fine-tuning metrics (Avg Prec, ROC AUC) over single-view and ImageNet-pretrained baselines. Analyses of attention maps and patch embedding similarity provide evidence that the model learns view-invariant, anatomically meaningful representations, suggesting a pathway to 3D-aware understanding from 2D radiographs and domain-specific self-supervised learning for veterinary imaging.

Abstract

Self-supervised learning has emerged as a powerful paradigm for training deep neural networks, particularly in medical imaging where labeled data is scarce. While current approaches typically rely on synthetic augmentations of single images, we propose VET-DINO, a framework that leverages a unique characteristic of medical imaging: the availability of multiple standardized views from the same study. Using a series of clinical veterinary radiographs from the same patient study, we enable models to learn view-invariant anatomical structures and develop an implied 3D understanding from 2D projections. We demonstrate our approach on a dataset of 5 million veterinary radiographs from 668,000 canine studies. Through extensive experimentation, including view synthesis and downstream task performance, we show that learning from real multi-view pairs leads to superior anatomical understanding compared to purely synthetic augmentations. VET-DINO achieves state-of-the-art performance on various veterinary imaging tasks. Our work establishes a new paradigm for self-supervised learning in medical imaging that leverages domain-specific properties rather than merely adapting natural image techniques.

VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging

TL;DR

Abstract

VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)