Table of Contents
Fetching ...

ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

William Yicheng Zhu, Keren Ye, Junjie Ke, Jiahui Yu, Leonidas Guibas, Peyman Milanfar, Feng Yang

TL;DR

This work addresses zero-shot visual attribute recognition by identifying the limitations of contrastive image-text representations in modeling object–attribute dependencies. It introduces a framework that combines image-conditioned prefix language modeling with a generative retrieval approach, reframing attribute recognition as learning image–object–attribute conditional probabilities and using sentence templates to encode dependencies. The authors formalize several dependency structures, finetune via per-class bias and scale, and demonstrate superior zero-shot and finetuned performance on VAW and VGARank, including improved handling of long-tail attributes. The results suggest that distilling knowledge from a vision-language foundation through generative retrieval yields flexible, robust attribute reasoning with broad potential for downstream visual reasoning tasks, at the cost of higher inference complexity.

Abstract

Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute's relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).

ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

TL;DR

This work addresses zero-shot visual attribute recognition by identifying the limitations of contrastive image-text representations in modeling object–attribute dependencies. It introduces a framework that combines image-conditioned prefix language modeling with a generative retrieval approach, reframing attribute recognition as learning image–object–attribute conditional probabilities and using sentence templates to encode dependencies. The authors formalize several dependency structures, finetune via per-class bias and scale, and demonstrate superior zero-shot and finetuned performance on VAW and VGARank, including improved handling of long-tail attributes. The results suggest that distilling knowledge from a vision-language foundation through generative retrieval yields flexible, robust attribute reasoning with broad potential for downstream visual reasoning tasks, at the cost of higher inference complexity.

Abstract

Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute's relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).
Paper Structure (12 sections, 1 equation, 5 figures, 6 tables)

This paper contains 12 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Prefix language modeling and generative prompting. During pretraining, the image-conditioned prefix language model (prefixLM) learns to generate the captions associated with images, and through this way it curates knowledge and learn to reason on object-attribute composition and dependency present in the sentence. In the downstream attribute recognition task, we propose a novel generative retrieval strategy to extract and apply the knowledge acquired from the prefixLM's large-scale pretraining. Different from contrastive retrieval, generative retrieval models the conditional dependency in a sentence, hence is more aligned with the actual language semantics. {A} and {O} are placeholders for attributes or objects in the sentence.
  • Figure 2: Conditional dependencies modeled by different sentence templates. Attribute recognition is modeled as a fill-in-the-blank problem for the highlighted “ {A}” in the graph. Our proposed method optimizes or approximates the joint probability of observing these graph meta-modals, all while only relying on the prefixLM pre-training.
  • Figure 3: Overview of Coca. CoCa integrates both contrastive learning and prefix language modeling. While its text decoder as a whole (Unimodal+Multimodal) learns to caption images, the first few layers (Unimodal) can be used for contrastive learning.
  • Figure 4: Zero-shot attribute prediction - qualitative results on the VAW dataset. Images are cropped using the yellow bounding boxes, and models only see the areas inside the boxes.
  • Figure 5: More qualitative examples on the VAW dataset, zero-shot vs. fine-tuned. The generative and contrastive columns use zero-shot retrieval, while the baseline column SCoNE [14] is fine-tuned on the VAW dataset.