LlamaSeg: Image Segmentation via Autoregressive Mask Generation
Jiru Deng, Tengjin Weng, Tianyu Yang, Wenhan Luo, Zhiheng Li, Wenhao Jiang
TL;DR
This work reframes image segmentation as a visual generation task, treating masks as discrete tokens and predicting them with a LLaMA-based autoregressive model conditioned on image and language input. It introduces a mask tokenizer based on VQGAN and a data pipeline (SA-OVRS) that yields 2M segmentation masks with open-vocabulary labels, plus a composite contour metric that combines $IoU$ and $AHD$ for edge fidelity. The approach achieves superior performance on semantic and referring segmentation benchmarks relative to existing visual generative models with comparable parameters, and supports both scratch and MLLM-assisted training. By unifying semantic and language-guided segmentation in a single autoregressive framework, it enables fine-grained, text-conditioned mask generation and provides a scalable path toward universal vision models with improved cross-modal alignment.
Abstract
We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions. We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs. By adhering to the next-token prediction paradigm, our approach naturally integrates segmentation tasks into autoregressive architectures. To support large-scale training, we introduce a data annotation pipeline and construct the SA-OVRS dataset, which contains 2M segmentation masks annotated with over 5,800 open-vocabulary labels or diverse textual descriptions, covering a wide spectrum of real-world scenarios. This enables our model to localize objects in images based on text prompts and to generate fine-grained masks. To more accurately evaluate the quality of masks produced by visual generative models, we further propose a composite metric that combines Intersection over Union (IoU) with Average Hausdorff Distance (AHD), offering a more precise assessment of contour fidelity. Experimental results demonstrate that our method surpasses existing generative models across multiple datasets and yields more detailed segmentation masks.
