Table of Contents
Fetching ...

Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits

Ada-Astrid Balauca, Danda Pani Paudel, Kristina Toutanova, Luc Van Gool

TL;DR

The proposed method (MUZE) learns to map CLIP's image embeddings to the tabular structure by means of a proposed transformer-based parsing network (parseNet) that enables prediction of missing attribute values while integrating context from known attribute-value pairs for an input image.

Abstract

CLIP is a powerful and widely used tool for understanding images in the context of natural language descriptions to perform nuanced tasks. However, it does not offer application-specific fine-grained and structured understanding, due to its generic nature. In this work, we aim to adapt CLIP for fine-grained and structured -- in the form of tabular data -- visual understanding of museum exhibits. To facilitate such understanding we (a) collect, curate, and benchmark a dataset of 200K+ image-table pairs, and (b) develop a method that allows predicting tabular outputs for input images. Our dataset is the first of its kind in the public domain. At the same time, the proposed method is novel in leveraging CLIP's powerful representations for fine-grained and tabular understanding. The proposed method (MUZE) learns to map CLIP's image embeddings to the tabular structure by means of a proposed transformer-based parsing network (parseNet). More specifically, parseNet enables prediction of missing attribute values while integrating context from known attribute-value pairs for an input image. We show that this leads to significant improvement in accuracy. Through exhaustive experiments, we show the effectiveness of the proposed method on fine-grained and structured understanding of museum exhibits, by achieving encouraging results in a newly established benchmark. Our dataset and source-code can be found at: https://github.com/insait-institute/MUZE

Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits

TL;DR

The proposed method (MUZE) learns to map CLIP's image embeddings to the tabular structure by means of a proposed transformer-based parsing network (parseNet) that enables prediction of missing attribute values while integrating context from known attribute-value pairs for an input image.

Abstract

CLIP is a powerful and widely used tool for understanding images in the context of natural language descriptions to perform nuanced tasks. However, it does not offer application-specific fine-grained and structured understanding, due to its generic nature. In this work, we aim to adapt CLIP for fine-grained and structured -- in the form of tabular data -- visual understanding of museum exhibits. To facilitate such understanding we (a) collect, curate, and benchmark a dataset of 200K+ image-table pairs, and (b) develop a method that allows predicting tabular outputs for input images. Our dataset is the first of its kind in the public domain. At the same time, the proposed method is novel in leveraging CLIP's powerful representations for fine-grained and tabular understanding. The proposed method (MUZE) learns to map CLIP's image embeddings to the tabular structure by means of a proposed transformer-based parsing network (parseNet). More specifically, parseNet enables prediction of missing attribute values while integrating context from known attribute-value pairs for an input image. We show that this leads to significant improvement in accuracy. Through exhaustive experiments, we show the effectiveness of the proposed method on fine-grained and structured understanding of museum exhibits, by achieving encouraging results in a newly established benchmark. Our dataset and source-code can be found at: https://github.com/insait-institute/MUZE
Paper Structure (32 sections, 3 equations, 9 figures, 15 tables)

This paper contains 32 sections, 3 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Fine-grained examples of materials, categories, techniques, and productionDates predicted by pretrained CLIP and the proposed method (MUZE). MUZE benefits from the tabular structure of the output as well as the context provided by the other attribute-answer pairs.
  • Figure 2: Our dataset contains a variety of attributes with annotation of corresponding fine-grained labels. Some of those attributes and a subset of their possible values are highlighted here, along with sample images corresponding to a chosen class.
  • Figure 3: Quantitative analysis of the value distribution for some attributes considered suitable for classification. Each chart displays the number of images which share the most common values for the corresponding attribute. The values with counts lower than a threshold were cumulated inside the chart under the name of others. Note that different values for a given attribute don't necessarily describe disjoint sets, e.g. some objects can have both paper and ink as values for the materials attribute.
  • Figure 4: Quantitative analysis of various collected attributes across samples. Left: Histogram of per-sample count of non-empty attribute columns for the two sub-datasets; A-MUZE has a total of 18 attribute columns, while B-MUZE has 12. We also illustrate samples of images with few (2) or many (18 or 12) non-empty attributes. Right: Violin plots representing the distribution of text lengths (number of characters) for textual attributes.
  • Figure 5: Schematic representation of our proposed method (MUZE). We show the process of obtaining CLIP embeddings for the input image ($\mathsf{e}_I$), attribute names ($\mathsf{e}_{A_i}$) and attribute values ($\mathsf{e}_{V_i}$). After replacing the embeddings of the query attribute values with [MASK] tokens we pass the obtained sequence of embeddings through parseNet to obtain the predicted embeddings for the query attributes. The CLIP Image Encoder and parseNet are trained to maximize the cosine similarity between the target and predicted embeddings.
  • ...and 4 more figures