Table of Contents
Fetching ...

ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation

Lingfeng Wang, Hualing Lin, Senda Chen, Tao Wang, Changxu Cheng, Yangyang Zhong, Dong Zheng, Wuyue Zhao

TL;DR

The paper addresses the rigidity of fixed-length image tokens in multimodal language models by introducing ALTo, an adaptive-length tokenizer for autoregressive mask generation, and ALToLLM, a multimodal LLM that integrates ALTo. ALTo combines a mask tokenizer, a pixel-attentive mask de-tokenizer, and a token length predictor with differentiable token chunking to produce variable-length mask token sequences, optimized with a reconstruction loss and a length-regularization term. ALToLLM is trained via supervised fine-tuning and Group Relative Policy Optimization (GRPO) to balance mask quality and token efficiency, achieving state-of-the-art results on gRefCOCO, RefCOCO, RefCOCOm, and open-vocabulary segmentation benchmarks, while reducing average token usage from a fixed 32 to around 17 tokens. The work demonstrates that adaptive-token strategies can dramatically improve efficiency without sacrificing segmentation performance, enabling more scalable vision-language systems and offering a path toward broader adaptive tokenization in RGB and other modalities.

Abstract

While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at https://github.com/yayafengzi/ALToLLM.

ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation

TL;DR

The paper addresses the rigidity of fixed-length image tokens in multimodal language models by introducing ALTo, an adaptive-length tokenizer for autoregressive mask generation, and ALToLLM, a multimodal LLM that integrates ALTo. ALTo combines a mask tokenizer, a pixel-attentive mask de-tokenizer, and a token length predictor with differentiable token chunking to produce variable-length mask token sequences, optimized with a reconstruction loss and a length-regularization term. ALToLLM is trained via supervised fine-tuning and Group Relative Policy Optimization (GRPO) to balance mask quality and token efficiency, achieving state-of-the-art results on gRefCOCO, RefCOCO, RefCOCOm, and open-vocabulary segmentation benchmarks, while reducing average token usage from a fixed 32 to around 17 tokens. The work demonstrates that adaptive-token strategies can dramatically improve efficiency without sacrificing segmentation performance, enabling more scalable vision-language systems and offering a path toward broader adaptive tokenization in RGB and other modalities.

Abstract

While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at https://github.com/yayafengzi/ALToLLM.

Paper Structure

This paper contains 20 sections, 7 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: ALToLLM realizes adaptive-length mask token generation according to object complexity.
  • Figure 2: Architecture of the proposed ALToLLM.
  • Figure 3: Training recipes for ALTo and ALToLLM. (a) ALTo Pretraining: Joint training of mask tokenizer (MT) and de-tokenizer (MD); (b) Adaptive-length Prediction: Training only the token length predictor (TLP); (c) Multimodal Integration: Exclusive training of MLLM with frozen ALTo for language-aware adaptation; (d) Group Relative Policy Optimization: Reinforcement learning for MLLM optimization. Input image in (b), (c) and (d) is processed identically to (a), omitted for visual clarity.
  • Figure 4: Examples from the Multi-Target-SA1B dataset.
  • Figure 5: Examples from the constructed multi-class version of open-vocabulary segmentation datasets. (a) ADE20K (A-150); (b) PASCAL Context59 (PC-59); (c) PASCAL VOC 20 (PAS-20).
  • ...and 5 more figures