Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

Junyuan Mao; Qiankun Li; Linghao Meng; Zhicheng He; Xinliang Zhou; Kun Wang; Yang Liu; Yueming Jin

Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

Junyuan Mao, Qiankun Li, Linghao Meng, Zhicheng He, Xinliang Zhou, Kun Wang, Yang Liu, Yueming Jin

TL;DR

Granulon is proposed, a novel DINOv3-based MLLM with adaptive granularity augmentation that enables unified"pixel-to-fine-to-coarse"reasoning within a single forward pass.

Abstract

Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified "pixel-to-fine-to-coarse" reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.

Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

TL;DR

Granulon is proposed, a novel DINOv3-based MLLM with adaptive granularity augmentation that enables unified"pixel-to-fine-to-coarse"reasoning within a single forward pass.

Abstract

Paper Structure (43 sections, 11 equations, 24 figures, 2 tables, 1 algorithm)

This paper contains 43 sections, 11 equations, 24 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Semantic-level Visual Encoder.
Pixel-level Visual Encoder.
Multimodel Large Language Models.
Methodology
Overview
Notations.
Target.
Training objective.
Overall Pipeline.
Granulon
Text-conditioned Granularity Controller
Adaptive Token Aggregation (AdaTA)
(a) Granularity-guided Pooling.
...and 28 more sections

Figures (24)

Figure 1: CLIP tends to emphasize global semantics and DINOv3 excels in pixel-level understanding. Our Granulon unleashes the potential of DINOv3 with adaptive granularity augmentation, achieving consistent semantic alignment across different levels of detail and delivering superior multimodal reasoning performance.
Figure 2: Overview of $\textbf{Granulon}$. (a) The architecture of the image processor. (b) The detailed process of AdaTA that generates semantic tokens for multi-granularity augmentation. (c) The training and running of Controller.
Figure 3: Distribution of accuracy and granularity obtained from reasoning outputs. We summarize results across 120 samples, where the granularity and accuracy scores of each sentence are collected. Lower granularity score means finer-grained outputs.
Figure 4: Hallucination rate of various visual encoder.
Figure 5: Comparison of sentence-level granularity and hallucination scores across different visual encoders. Sentence index means the n-th sentence generated by LLMs.
...and 19 more figures

Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

TL;DR

Abstract

Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

Authors

TL;DR

Abstract

Table of Contents

Figures (24)