Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Yongyi Su, Haojie Zhang, Shijie Li, Nanqing Liu, Jingyi Liao, Junyi Pan, Yuan Liu, Xiaofen Xing, Chong Sun, Chen Li, Nancy F. Chen, Shuicheng Yan, Xulei Yang, Xun Xu
TL;DR
This work introduces Patch-as-Decodable Token (PaDT), a unified framework for multimodal large language models that directly generates textual and visual outputs through Visual Reference Tokens (VRTs). A Dynamic Embedding Module expands the model's codebook on a per-image basis, and a lightweight PaDT Decoder translates predicted VRTs into diverse outputs such as bounding boxes and masks, enabling dense vision tasks within an LLM framework. The authors demonstrate state-of-the-art results across fine-grained perception and understanding tasks (REC, RES, OVD, RIC) with both 3B and 7B models, and show strong multi-task scalability (PaDT Pro). The approach addresses misalignment and format inconsistency in prior methods by grounding visual predictions in a coherent, token-based space closely aligned with the LLM. The work also provides robust training techniques and ablation studies, and releases code for reproducibility.
Abstract
Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM's output textual tokens. A lightweight decoder then transforms LLM's outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.
