Table of Contents
Fetching ...

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Shashanka Venkataramanan, Valentinos Pariza, Mohammadreza Salehi, Lukas Knobel, Spyros Gidaris, Elias Ramzi, Andrei Bursuc, Yuki M. Asano

TL;DR

Franca tackles the reproducibility gap in vision foundation models by delivering a fully open-data, open-weight, open-code model pretrained on public data. It introduces Matryoshka-based nested multi-head clustering for scalable, multi-granular representations, along with CyclicMask and RASA to reduce positional biases and emphasize semantic content. Across classification, segmentation, robustness, and 3D understanding benchmarks, Franca matches or exceeds proprietary-model performance without distillation or private data, demonstrating strong generalization and openness. This work sets a new standard for transparent, high-performance vision foundations with demonstrated applicability to dense prediction and 3D tasks.

Abstract

We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

TL;DR

Franca tackles the reproducibility gap in vision foundation models by delivering a fully open-data, open-weight, open-code model pretrained on public data. It introduces Matryoshka-based nested multi-head clustering for scalable, multi-granular representations, along with CyclicMask and RASA to reduce positional biases and emphasize semantic content. Across classification, segmentation, robustness, and 3D understanding benchmarks, Franca matches or exceeds proprietary-model performance without distillation or private data, demonstrating strong generalization and openness. This work sets a new standard for transparent, high-performance vision foundations with demonstrated applicability to dense prediction and 3D tasks.

Abstract

We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.

Paper Structure

This paper contains 30 sections, 5 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Overview of Franca. Top-left: We learn efficient Matryoshka-stylekusupati2022matryoshka visual representations using a multi-head clustering projection head. The encoder produces features $z \in \mathbb{R}^d$, which is sliced into progressively smaller subsets of dimensions $d,\dots d/8, d/16$. Each slice passes through a projection head and a corresponding clustering head with cluster counts $c,\dots, c/8, c/16$, inducing a coarse-to-fine hierarchy of semantic abstraction. Top-right: Unlike prior approaches trained on curated academic datasets, e.g., LVD-142M in DINOv2 or proprietary data like WebLI in SigLIPv2, Franca is trained on open-source internet-scale uncurated data. Bottom: Despite this, it generalizes well across model scales and achieves strong performance on diverse downstream tasks, including in-context learning balazevic2023towards, out-of-distribution detection yang2022openood, and 3D understanding chen2025feat2gs.
  • Figure 2: Pretraining ablation of Franca. Starting from a ViT-B/14 pretrained on ImageNet-21K, we show the impact of each proposed components. The inner bar represents in-context segmentation performance on the Hummingbird benchmark balazevic2023towards, while the outer bar shows linear probing accuracy on the ImageNet-1K russakovsky2015imagenet. Each addition, i.e., CyclicMask, Matryoshka representations, RASA, and High resolution finetuning, results in consistent improvements.
  • Figure 3: PCA visualizations across Matryoshka slices. We show the first three PCA components for different feature slices $m_j$ of Franca and DINOv2. Despite Franca being trained only up to $\text{dim}/16$, it maintains coherent part structure even in smaller feature dimension as compared to DINOv2.
  • Figure 4: k-NN classification accuracy on ImageNet-v2 at varying embedding slice levels using a ViT-L backbone. Franca consistently outperforms DINOv2 across all subspace dimensions, maintaining high performance even under strong compression ($\text{dim}/64$). Note that DINOv2 was not trained with sliced dimensions and its features are uniformly distributed across the full embedding space.
  • Figure 5: Masking strategies used in masked image modeling. Compared to Random (a), Block (b), and Inverse (c) masking, our CyclicMask (d) circularly shifts the visible region across spatial axes, preventing the model from being biased toward specific spatial locations.
  • ...and 7 more figures