Table of Contents
Fetching ...

DINO-MX: A Modular & Flexible Framework for Self-Supervised Learning

Mahmut Selman Gokmen, Cody Bumgardner

TL;DR

DINO-MX introduces a modular, configuration-driven framework for self-supervised learning in medical imaging that unifies DINOv1, DINOv2, and related approaches within a HuggingFace-friendly ecosystem. It emphasizes domain adaptation, resource efficiency through LoRA and layer freezing, and interpretable attention analyses, while supporting diverse medical data types and cross-training across SSL paradigms. Empirical results on MedMNIST and CT calcification tasks demonstrate competitive performance with substantially reduced computational demands, aided by label-guided augmentation and attention-based localization that obviate extra detection heads. The work offers a reproducible, scalable foundation for benchmarking and deploying self-supervised vision models in medical contexts, with future directions toward integrating vision encoders with Large Language Models for richer clinical reasoning.

Abstract

Vision Foundation Models (VFMs) have advanced representation learning through self-supervised methods. However, existing training pipelines are often inflexible, domain-specific, or computationally expensive, which limits their usability across different domains and resource settings. DINO-MX is a modular and extensible training framework that combines the core principles of DINO, DINOv2 and DINOv3 within a unified configuration-driven system. It supports a variety of transformer-based architectures and is fully compatible with the Hugging Face ecosystem. The framework includes multiple training strategies such as low-rank adaptation (LoRA), layer freezing, and knowledge distillation, along with support for distributed training through both Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP). DINO-MX is designed to work with both natural and specialized data types, including single- and multi-channel images. Experimental results on diverse datasets show that DINO-MX achieves competitive performance while significantly reducing computational costs. Additionally, it offers interpretability tools and a label-guided data augmentation method that improves attention-based localization without the need for extra detection or segmentation heads. DINO-MX provides a reproducible and scalable foundation for developing, adapting, and benchmarking self-supervised vision models across a range of research and real-world applications.

DINO-MX: A Modular & Flexible Framework for Self-Supervised Learning

TL;DR

DINO-MX introduces a modular, configuration-driven framework for self-supervised learning in medical imaging that unifies DINOv1, DINOv2, and related approaches within a HuggingFace-friendly ecosystem. It emphasizes domain adaptation, resource efficiency through LoRA and layer freezing, and interpretable attention analyses, while supporting diverse medical data types and cross-training across SSL paradigms. Empirical results on MedMNIST and CT calcification tasks demonstrate competitive performance with substantially reduced computational demands, aided by label-guided augmentation and attention-based localization that obviate extra detection heads. The work offers a reproducible, scalable foundation for benchmarking and deploying self-supervised vision models in medical contexts, with future directions toward integrating vision encoders with Large Language Models for richer clinical reasoning.

Abstract

Vision Foundation Models (VFMs) have advanced representation learning through self-supervised methods. However, existing training pipelines are often inflexible, domain-specific, or computationally expensive, which limits their usability across different domains and resource settings. DINO-MX is a modular and extensible training framework that combines the core principles of DINO, DINOv2 and DINOv3 within a unified configuration-driven system. It supports a variety of transformer-based architectures and is fully compatible with the Hugging Face ecosystem. The framework includes multiple training strategies such as low-rank adaptation (LoRA), layer freezing, and knowledge distillation, along with support for distributed training through both Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP). DINO-MX is designed to work with both natural and specialized data types, including single- and multi-channel images. Experimental results on diverse datasets show that DINO-MX achieves competitive performance while significantly reducing computational costs. Additionally, it offers interpretability tools and a label-guided data augmentation method that improves attention-based localization without the need for extra detection or segmentation heads. DINO-MX provides a reproducible and scalable foundation for developing, adapting, and benchmarking self-supervised vision models across a range of research and real-world applications.

Paper Structure

This paper contains 37 sections, 3 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Comparison of pixel value distributions between (a) a CT scan image and (b) a natural image. The CT distribution is defined in Hounsfield Units (HU), while natural images are represented in RGB channels with pixel values in the [0-255] range.
  • Figure 2: Impact of dataset size when varying data augmentations. Results of ViT-L on linear evaluation benchmarks. Cropping without resizing ('Crop') reaches very high performances comparable to full augmentation ('Original') on a wide variety of benchmarks when the dataset size is large enough.
  • Figure 3: General representation of DINO-MX framework
  • Figure 4: The simplified, high-level configuration for parallelization strategies in the DINO-MX framework.
  • Figure 5: Example representation of knowledge distillation system in DINO-MX framework.
  • ...and 5 more figures