Table of Contents
Fetching ...

REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

Yan Tai, Luhao Zhu, Yunan Ding, Yiying Dong, Guangtao Zhai, Xiaohong Liu, Guodong Guo

TL;DR

REF-VLM introduces a Triplet-Based Referring Paradigm (TRP) and a large-scale Visual-Task Instruction Following Dataset (VT-Instruct) to achieve unified end-to-end visual decoding across diverse tasks. By decoupling concepts, decoding types, and references, TRP enables multi-granularity and multi-task referencing within a single framework. The architecture employs a Mask-Guided Aggregation for parameter-free fusion of visual prompts, a Latent Embeddings Router, and Parallel Grouped Hungarian Matching to support joint training across multiple visual unit decoders. Experiments show REF-VLM outperforms several MLLMs across visual understanding, referring expression tasks, grounded generation, and open-vocabulary identification, demonstrating strong cross-task adaptability and scalability for real-world vision-language applications.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot capabilities across diverse vision-language tasks after training on mega-scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi-task learning and multi-granularity scenarios. In this work, we present \textbf{REF-VLM}, an end-to-end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the \textbf{Triplet-Based Referring Paradigm (TRP)}, which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct \textbf{Visual-Task Instruction Following Dataset (VT-Instruct)}, a large-scale multi-task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF-VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF-VLM outperforms other MLLMs across a variety of standard benchmarks. The code, dataset, and demo will be publicly available.

REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

TL;DR

REF-VLM introduces a Triplet-Based Referring Paradigm (TRP) and a large-scale Visual-Task Instruction Following Dataset (VT-Instruct) to achieve unified end-to-end visual decoding across diverse tasks. By decoupling concepts, decoding types, and references, TRP enables multi-granularity and multi-task referencing within a single framework. The architecture employs a Mask-Guided Aggregation for parameter-free fusion of visual prompts, a Latent Embeddings Router, and Parallel Grouped Hungarian Matching to support joint training across multiple visual unit decoders. Experiments show REF-VLM outperforms several MLLMs across visual understanding, referring expression tasks, grounded generation, and open-vocabulary identification, demonstrating strong cross-task adaptability and scalability for real-world vision-language applications.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot capabilities across diverse vision-language tasks after training on mega-scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi-task learning and multi-granularity scenarios. In this work, we present \textbf{REF-VLM}, an end-to-end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the \textbf{Triplet-Based Referring Paradigm (TRP)}, which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct \textbf{Visual-Task Instruction Following Dataset (VT-Instruct)}, a large-scale multi-task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF-VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF-VLM outperforms other MLLMs across a variety of standard benchmarks. The code, dataset, and demo will be publicly available.

Paper Structure

This paper contains 14 sections, 8 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Comparison of Visual Unit Decoding Methods. Benefiting from the Triplet-Based Referring Paradigm, REF-VLM can adapt to more complex granularity scenarios and visual decoding tasks, enhancing the interpretability and accuracy of the MLLM's responses.
  • Figure 2: Example of VT-Instruct Dataset by Using the Automated Data Construction Pipeline. Our VT-Instruct dataset contains seven distinct downstream tasks, including Visual Understanding, Referring Expression, Interactive Grounding, Grounded Conversation Generation, Open-Vocabulary Identification and Depth Estimation.
  • Figure 3: The Framework of REF-VLM. REF-VLM employs dual-architecture visual encoders to jointly encode images into a feature pyramid, enhancing visual unit decoder performance. Additionally, visual prompts are fused with global features and share a projector, enabling parameter-free encoding of image interactions. Training samples adhere to the Triplet-Based Referring Paradigm, ensuring one-to-one mapping between REF-VLM's latent embeddings and decoding targets.
  • Figure 4: Architecture of Visual Unit Decoders. We propose a Latent Embeddings Router to facilitate unified multi-task training in REF-VLM, and enhance the Hungarian matching algorithm for the TRP-based one-to-one referring decoding scheme.
  • Figure 5: The Comparison of parameter numbers. We compared PixelLM xu2024pixel, GLaMM rasheed2024glamm, VisionLLMv2 visionllmv2, and our REF-VLM in terms of the parameter count of the backbone used for feature extraction in the visual decoder module.
  • ...and 1 more figures