DINO-MX: A Modular & Flexible Framework for Self-Supervised Learning
Mahmut Selman Gokmen, Cody Bumgardner
TL;DR
DINO-MX introduces a modular, configuration-driven framework for self-supervised learning in medical imaging that unifies DINOv1, DINOv2, and related approaches within a HuggingFace-friendly ecosystem. It emphasizes domain adaptation, resource efficiency through LoRA and layer freezing, and interpretable attention analyses, while supporting diverse medical data types and cross-training across SSL paradigms. Empirical results on MedMNIST and CT calcification tasks demonstrate competitive performance with substantially reduced computational demands, aided by label-guided augmentation and attention-based localization that obviate extra detection heads. The work offers a reproducible, scalable foundation for benchmarking and deploying self-supervised vision models in medical contexts, with future directions toward integrating vision encoders with Large Language Models for richer clinical reasoning.
Abstract
Vision Foundation Models (VFMs) have advanced representation learning through self-supervised methods. However, existing training pipelines are often inflexible, domain-specific, or computationally expensive, which limits their usability across different domains and resource settings. DINO-MX is a modular and extensible training framework that combines the core principles of DINO, DINOv2 and DINOv3 within a unified configuration-driven system. It supports a variety of transformer-based architectures and is fully compatible with the Hugging Face ecosystem. The framework includes multiple training strategies such as low-rank adaptation (LoRA), layer freezing, and knowledge distillation, along with support for distributed training through both Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP). DINO-MX is designed to work with both natural and specialized data types, including single- and multi-channel images. Experimental results on diverse datasets show that DINO-MX achieves competitive performance while significantly reducing computational costs. Additionally, it offers interpretability tools and a label-guided data augmentation method that improves attention-based localization without the need for extra detection or segmentation heads. DINO-MX provides a reproducible and scalable foundation for developing, adapting, and benchmarking self-supervised vision models across a range of research and real-world applications.
