Table of Contents
Fetching ...

Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition

Ethan Baron, Idan Tankel, Peter Tu, Guy Ben-Yosef

TL;DR

This work defines zero-shot real object classification by description, challenging CLIP-like VLMs to classify objects using descriptive attributes without object class names. It proposes a three-pronged approach: (1) generate rich, name-free attribute descriptions with LLMs, (2) fine-tune CLIP on ImageNet21k using synthetic attribute data, and (3) introduce a multi-resolution CLIP architecture to improve fine-grained part-attribute detection. The authors release name-free description benchmarks and demonstrate consistent improvements in both part-attribute recognition (PACO) and six fine-grained datasets for classification by description, with larger gains from the Columbia prompting style and the multi-resolution model. They also discuss limitations of late-fusion CLIP and suggest directions toward spatially aware models to further enhance descriptive understanding and zero-shot performance.

Abstract

In this study, we define and tackle zero shot "real" classification by description, a novel task that evaluates the ability of Vision-Language Models (VLMs) like CLIP to classify objects based solely on descriptive attributes, excluding object class names. This approach highlights the current limitations of VLMs in understanding intricate object descriptions, pushing these models beyond mere object recognition. To facilitate this exploration, we introduce a new challenge and release description data for six popular fine-grained benchmarks, which omit object names to encourage genuine zero-shot learning within the research community. Additionally, we propose a method to enhance CLIP's attribute detection capabilities through targeted training using ImageNet21k's diverse object categories, paired with rich attribute descriptions generated by large language models. Furthermore, we introduce a modified CLIP architecture that leverages multiple resolutions to improve the detection of fine-grained part attributes. Through these efforts, we broaden the understanding of part-attribute recognition in CLIP, improving its performance in fine-grained classification tasks across six popular benchmarks, as well as in the PACO dataset, a widely used benchmark for object-attribute recognition. Code is available at: https://github.com/ethanbar11/grounding_ge_public.

Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition

TL;DR

This work defines zero-shot real object classification by description, challenging CLIP-like VLMs to classify objects using descriptive attributes without object class names. It proposes a three-pronged approach: (1) generate rich, name-free attribute descriptions with LLMs, (2) fine-tune CLIP on ImageNet21k using synthetic attribute data, and (3) introduce a multi-resolution CLIP architecture to improve fine-grained part-attribute detection. The authors release name-free description benchmarks and demonstrate consistent improvements in both part-attribute recognition (PACO) and six fine-grained datasets for classification by description, with larger gains from the Columbia prompting style and the multi-resolution model. They also discuss limitations of late-fusion CLIP and suggest directions toward spatially aware models to further enhance descriptive understanding and zero-shot performance.

Abstract

In this study, we define and tackle zero shot "real" classification by description, a novel task that evaluates the ability of Vision-Language Models (VLMs) like CLIP to classify objects based solely on descriptive attributes, excluding object class names. This approach highlights the current limitations of VLMs in understanding intricate object descriptions, pushing these models beyond mere object recognition. To facilitate this exploration, we introduce a new challenge and release description data for six popular fine-grained benchmarks, which omit object names to encourage genuine zero-shot learning within the research community. Additionally, we propose a method to enhance CLIP's attribute detection capabilities through targeted training using ImageNet21k's diverse object categories, paired with rich attribute descriptions generated by large language models. Furthermore, we introduce a modified CLIP architecture that leverages multiple resolutions to improve the detection of fine-grained part attributes. Through these efforts, we broaden the understanding of part-attribute recognition in CLIP, improving its performance in fine-grained classification tasks across six popular benchmarks, as well as in the PACO dataset, a widely used benchmark for object-attribute recognition. Code is available at: https://github.com/ethanbar11/grounding_ge_public.

Paper Structure

This paper contains 15 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Columbia and Oxford Style Descriptions Without Names. Examples highlight the difference between Columbia's concise, focused descriptions and Oxford's broader, narrative-driven approaches, both omitting object names. The descriptions for each example were created using one of 2 styles — the Columbia style and the Oxford style. Each style is a method to prompt the LLM for descriptions (usually 8 sentences are created from each prompt style)
  • Figure 2: Real zero-shot training on ImageNet21k. This figure illustrates the process of preparing and conducting the CLIP model training for improved real zero-shot classification. The procedure begins with the selection of classes from the ImageNet21k dataset, focusing on a diverse range of object categories. For each selected class, image examples are gathered alongside the corresponding attribute descriptions generated by a Large Language Model (LLM), emphasizing the attributes of parts without including the object class names.
  • Figure 3: Our Multi-Res CLIP architecture. Multiple image slices are processed via the CLIP Vision model, and multi-resolution features are aggregated using an additional CLIP Vision layer. The original CLIP model remains frozen and only the new layer is being trained.
  • Figure 4: Comprehensive view of ImageNet21k training impacts. (A). The distribution of ImageNet21k classes used in training. The reason why plants and animals are dominant is because there are multiple sub-categories for these two types in ImageNet21k. These sub-categories are beneficial for our training purposes, since LLMs can provide a rich set of essential features for them. (B). The impact of the number of ImageNet21k classes in the training set on zero-shot top-1 classification performance on the CUB dataset. (C). The impact of the number of images per class in the training set on zero-shot top-1 classification performance on the Flowers dataset.