Table of Contents
Fetching ...

DenseMLLM: Standard Multimodal LLMs are Intrinsic Dense Predictors

Yi Li, Hongze Shen, Lexiang Tang, Xin Li, Xinpeng Ding, Yinsong Liu, Deqiang Jiang, Xing Sun, Xiaomeng Li

TL;DR

The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization.

DenseMLLM: Standard Multimodal LLMs are Intrinsic Dense Predictors

TL;DR

The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization.
Paper Structure (21 sections, 5 equations, 9 figures, 9 tables)

This paper contains 21 sections, 5 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Motivation. (a): PCA visualization of the hidden states reveals that the vision tokens of the proposed DenseMLLM intrinsically encode fine-grained details. Thus, DenseMLLM achieves high-quality dense predictions (segmentation and depth) directly from vision tokens without task-specific decoders. (b): A single vision token typically represents multiple vocabulary IDs (labels), especially in multi-task scenarios, which contrasts with text tokens with only a single label. (c): These histograms indicate that vision tokens frequently have multiple labels across tasks. It motivates us to propose a new supervision strategy to align vision-text representations effectively for multi-label and multi-task.
  • Figure 2: The framework of DenseMLLM. (a) Overview: Employing a standard MLLM architecture (Vision Encoder, Projector, LLM), our model outputs both text responses and dense predictions without specialized heads. (b) Training with NTP-M: To handle vision tokens containing multiple semantics (e.g., objects and depth), we propose NTP-M. This multi-label strategy supervises vision tokens against a vocabulary with the proposed relevant negative sampling strategy, extending beyond standard single-label NTP. (c) Inference Comparison: Unlike methods relying on external decoders (e.g., SAM-based rasheed2024glamm), retrieval tokens (e.g., UFO tang2025ufo), or polygons visionllm, DenseMLLM requires no additions. We achieve dense prediction by indexing vision logits with text token IDs via argmax, ensuring architectural simplicity.
  • Figure 3: Qualitative Results. Visualization of three dense prediction tasks (semantic segmentation from ADE20k, depth estimation from the NYUv2, and referring expression segmentation from the RefCOCO).
  • Figure 4: Visualization of Vision Token Representations. This figure compares the Principal Component Analysis (PCA) visualizations oquab2023dinov2 of the last-layer hidden states of vision tokens. We contrast DenseMLLM-4B (with vision token supervision) against three models without such supervision: DenseMLLM-4B w/o, Qwen2.5-VL qwen2.5, and Qwen3-VL Qwen3-VL. Leveraging vision token supervision, our model exhibits outstanding feature separation and visualization quality compared to VLMs lacking this supervision.
  • Figure 5: Qualitative example of semantic segmentation the ADE20k dataset.
  • ...and 4 more figures