Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition
Ethan Baron, Idan Tankel, Peter Tu, Guy Ben-Yosef
TL;DR
This work defines zero-shot real object classification by description, challenging CLIP-like VLMs to classify objects using descriptive attributes without object class names. It proposes a three-pronged approach: (1) generate rich, name-free attribute descriptions with LLMs, (2) fine-tune CLIP on ImageNet21k using synthetic attribute data, and (3) introduce a multi-resolution CLIP architecture to improve fine-grained part-attribute detection. The authors release name-free description benchmarks and demonstrate consistent improvements in both part-attribute recognition (PACO) and six fine-grained datasets for classification by description, with larger gains from the Columbia prompting style and the multi-resolution model. They also discuss limitations of late-fusion CLIP and suggest directions toward spatially aware models to further enhance descriptive understanding and zero-shot performance.
Abstract
In this study, we define and tackle zero shot "real" classification by description, a novel task that evaluates the ability of Vision-Language Models (VLMs) like CLIP to classify objects based solely on descriptive attributes, excluding object class names. This approach highlights the current limitations of VLMs in understanding intricate object descriptions, pushing these models beyond mere object recognition. To facilitate this exploration, we introduce a new challenge and release description data for six popular fine-grained benchmarks, which omit object names to encourage genuine zero-shot learning within the research community. Additionally, we propose a method to enhance CLIP's attribute detection capabilities through targeted training using ImageNet21k's diverse object categories, paired with rich attribute descriptions generated by large language models. Furthermore, we introduce a modified CLIP architecture that leverages multiple resolutions to improve the detection of fine-grained part attributes. Through these efforts, we broaden the understanding of part-attribute recognition in CLIP, improving its performance in fine-grained classification tasks across six popular benchmarks, as well as in the PACO dataset, a widely used benchmark for object-attribute recognition. Code is available at: https://github.com/ethanbar11/grounding_ge_public.
