ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
William Yicheng Zhu, Keren Ye, Junjie Ke, Jiahui Yu, Leonidas Guibas, Peyman Milanfar, Feng Yang
TL;DR
This work addresses zero-shot visual attribute recognition by identifying the limitations of contrastive image-text representations in modeling object–attribute dependencies. It introduces a framework that combines image-conditioned prefix language modeling with a generative retrieval approach, reframing attribute recognition as learning image–object–attribute conditional probabilities and using sentence templates to encode dependencies. The authors formalize several dependency structures, finetune via per-class bias and scale, and demonstrate superior zero-shot and finetuned performance on VAW and VGARank, including improved handling of long-tail attributes. The results suggest that distilling knowledge from a vision-language foundation through generative retrieval yields flexible, robust attribute reasoning with broad potential for downstream visual reasoning tasks, at the cost of higher inference complexity.
Abstract
Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute's relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).
