Table of Contents
Fetching ...

LlamaSeg: Image Segmentation via Autoregressive Mask Generation

Jiru Deng, Tengjin Weng, Tianyu Yang, Wenhan Luo, Zhiheng Li, Wenhao Jiang

TL;DR

This work reframes image segmentation as a visual generation task, treating masks as discrete tokens and predicting them with a LLaMA-based autoregressive model conditioned on image and language input. It introduces a mask tokenizer based on VQGAN and a data pipeline (SA-OVRS) that yields 2M segmentation masks with open-vocabulary labels, plus a composite contour metric that combines $IoU$ and $AHD$ for edge fidelity. The approach achieves superior performance on semantic and referring segmentation benchmarks relative to existing visual generative models with comparable parameters, and supports both scratch and MLLM-assisted training. By unifying semantic and language-guided segmentation in a single autoregressive framework, it enables fine-grained, text-conditioned mask generation and provides a scalable path toward universal vision models with improved cross-modal alignment.

Abstract

We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions. We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs. By adhering to the next-token prediction paradigm, our approach naturally integrates segmentation tasks into autoregressive architectures. To support large-scale training, we introduce a data annotation pipeline and construct the SA-OVRS dataset, which contains 2M segmentation masks annotated with over 5,800 open-vocabulary labels or diverse textual descriptions, covering a wide spectrum of real-world scenarios. This enables our model to localize objects in images based on text prompts and to generate fine-grained masks. To more accurately evaluate the quality of masks produced by visual generative models, we further propose a composite metric that combines Intersection over Union (IoU) with Average Hausdorff Distance (AHD), offering a more precise assessment of contour fidelity. Experimental results demonstrate that our method surpasses existing generative models across multiple datasets and yields more detailed segmentation masks.

LlamaSeg: Image Segmentation via Autoregressive Mask Generation

TL;DR

This work reframes image segmentation as a visual generation task, treating masks as discrete tokens and predicting them with a LLaMA-based autoregressive model conditioned on image and language input. It introduces a mask tokenizer based on VQGAN and a data pipeline (SA-OVRS) that yields 2M segmentation masks with open-vocabulary labels, plus a composite contour metric that combines and for edge fidelity. The approach achieves superior performance on semantic and referring segmentation benchmarks relative to existing visual generative models with comparable parameters, and supports both scratch and MLLM-assisted training. By unifying semantic and language-guided segmentation in a single autoregressive framework, it enables fine-grained, text-conditioned mask generation and provides a scalable path toward universal vision models with improved cross-modal alignment.

Abstract

We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions. We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs. By adhering to the next-token prediction paradigm, our approach naturally integrates segmentation tasks into autoregressive architectures. To support large-scale training, we introduce a data annotation pipeline and construct the SA-OVRS dataset, which contains 2M segmentation masks annotated with over 5,800 open-vocabulary labels or diverse textual descriptions, covering a wide spectrum of real-world scenarios. This enables our model to localize objects in images based on text prompts and to generate fine-grained masks. To more accurately evaluate the quality of masks produced by visual generative models, we further propose a composite metric that combines Intersection over Union (IoU) with Average Hausdorff Distance (AHD), offering a more precise assessment of contour fidelity. Experimental results demonstrate that our method surpasses existing generative models across multiple datasets and yields more detailed segmentation masks.

Paper Structure

This paper contains 20 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Different implementations of image segmentation in autoregressive frameworks. (a) Embedding as mask. (b) Coordinates as mask. (c) Text as mask. (d) Visual tokens as mask (ours).
  • Figure 2: Overall framework of LlamaSeg. Our model comprises two components: a mask tokenizer and an autoregressive model. GT labels are only used as supervision during training.
  • Figure 3: Our two-stage data generation pipeline. The first stage requires generating labels and matching them with segmentation masks, and then identifying discrete entities. The second stage generates textual data based on the labels.
  • Figure 4: Visual comparison between our model and existing visual generative models.
  • Figure 5: Attention heatmap of mask tokens in the last self-attention layer of the autoregressive model. Scores have been log-transformed to enhance visual contrast.
  • ...and 1 more figures