Table of Contents
Fetching ...

Pixel Aligned Language Models

Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

TL;DR

PixelLLM introduces a vision-language model that assigns a precise pixel location to each output token, enabling both captioning and dense word grounding. Trained on the Localized Narratives dataset, it uses a prompt-conditioned architecture with a lightweight per-token 2D regression head and LoRA-tuned LLMs, achieving state-of-the-art results in referring localization, dense object captioning, and location-conditioned captioning. The approach demonstrates strong gains from end-to-end dense word-pixel alignment and show-and-tell style localization integrated with text generation. This work paves the way for spatially aware language models capable of fine-grained region understanding and generation tasks.

Abstract

Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using large language models. In this work, we aim to develop a vision-language model that can take locations, for example, a set of points or boxes, as either inputs or outputs. When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region. When generating locations as outputs, our model regresses pixel coordinates for each output word generated by the language model, and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks, including referring localization, location-conditioned captioning, and dense object captioning, archiving state-of-the-art performance on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .

Pixel Aligned Language Models

TL;DR

PixelLLM introduces a vision-language model that assigns a precise pixel location to each output token, enabling both captioning and dense word grounding. Trained on the Localized Narratives dataset, it uses a prompt-conditioned architecture with a lightweight per-token 2D regression head and LoRA-tuned LLMs, achieving state-of-the-art results in referring localization, dense object captioning, and location-conditioned captioning. The approach demonstrates strong gains from end-to-end dense word-pixel alignment and show-and-tell style localization integrated with text generation. This work paves the way for spatially aware language models capable of fine-grained region understanding and generation tasks.

Abstract

Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using large language models. In this work, we aim to develop a vision-language model that can take locations, for example, a set of points or boxes, as either inputs or outputs. When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region. When generating locations as outputs, our model regresses pixel coordinates for each output word generated by the language model, and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks, including referring localization, location-conditioned captioning, and dense object captioning, archiving state-of-the-art performance on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .
Paper Structure (18 sections, 6 equations, 7 figures, 5 tables)

This paper contains 18 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of PixelLLM architecture for pixel-aligned captioning. We first encode the input location prompt (global box prompt in this case) and the input image with the prompt encoder $\mathcal{P}$ and the image encoder $\mathcal{V}$ respectively. Then we input the prompt feature $\mathbf{l}$ and the image feature $\mathbf{f}$ into the prompt feature extractor to extract location-specific visual feature $\mathbf{f_l}$. The large language model $\mathcal{L}$ then auto-regressively predicts the next text tokens conditioned on previous text tokens and the visual feature. We apply a simple MLP layer on the token features before the vocabulary mapping layer of LLM, which predicts the coordinates of each text token. The alignment between the caption and the trace is represented by color gradient
  • Figure 2: Our model for referring expression localization pipeline. To apply PixelLLM, we don't need to generate the text tokens. Instead, we directly input the query $\mathbf{t}$ into the LLM $\mathcal{L}^-$ to extract the token features before the vocabulary mapping layer. We then apply MLP to the last token predict the bounding boxes.
  • Figure 3: Our model for location-conditioned captioning and dense object captioning. For location-conditioned captioning, the input bounding boxes are provided. For dense object captioning, we first apply a proposal head on the image feature to generate the bounding boxes. We input bounding boxes and image features into the prompt encoder and prompt feature extractor to extract the location-specific feature for each bounding box. The language model auto-regressively predicts the caption of each object.
  • Figure 4: Qualitative results on pixel-aligned captioning (row 1), referring segmentation (row 2), and dense object captioning (row 3). The generated trace semantically corresponds to the caption, represented by color gradient . In referring segmentation, our model correctly understands the descriptive referring expressions, e.g."sugar powdered", "with white feathers". For dense object captioning, our model could generate the region-level caption that captures the spatial relationship, e.g."shelves on the wall". Zoom in for the best view.
  • Figure 5: Qualitative results on pixel-aligned captioning. Zoom in for the best view.
  • ...and 2 more figures